Adaptive Protocols for Software Distributed Shared Memory by Amza, C. et al.
Adaptive Protocols for Software Distributed Shared Memory
Cristiana Amza
y
 Alan L Cox
y
 Sandhya Dwarkadas
z

LiJie Jin
y
 Karthick Rajamani

 and Willy Zwaenepoel
y
y
Department of Computer Science Rice University
z
Department of Computer Science University of Rochester

Department of Electrical and Computer Engineering Rice University
Abstract
We demonstrate the benets of software shared mem
ory protocols that adapt at runtime to the memory access
patterns observed in the applications This adaptation is
automatic  no user annotations are required  and does
not rely on compiler support or special hardware We in
vestigate adaptation between single and multiplewriter
protocols dynamic aggregation of pages into a larger trans
fer unit and adaptation between invalidate and update
Our results indicate that adaptation between single and
multiplewriter and dynamic page aggregation are clearly
benecial The results for the adaptation between invali
date and update are less compelling showing at best gains
similar to the dynamic aggregation adaptation and at worst
serious performance deterioration
I Introduction
Many dierent protocols have been proposed for im
plementing a software shared memory abstraction on dis
tributed memory hardware The relative performance of
these protocols is applicationdependent the memory ac
cess patterns of the application determine which protocols
exhibit good performance It is therefore appealing to build
a system with multiple protocols and let the system choose
between the dierent protocols based on the access patterns
it observes in the application In this paper we present
the design of such an adaptive software distributed shared
memory system and evaluate its performance
Specically this paper focuses on protocols that im
plement the lazy release consistency 	LRC
 memory
model  We furthermore assume that shared mem
ory accesses are detected using virtual memory protection
This paper explores the benets of LRC protocols that
adapt to the memory access patterns of the applications
by comparing their performance to nonadaptive versions
of the protocols In particular we investigate
 adaptation between single and multiplewriter proto
cols including adaptation to migratory access patterns
 dynamic aggregation of pages into larger transfer units
and
 adaptation between invalidate and update protocols
The adaptations considered in this paper are triggered au
tomatically the runtime system detects certain access pat
terns and switches between protocols accordingly This au
tomated adaptation distinguishes our work from so called
multiprotocol software shared memory implementations
	eg 
 in which the user has to annotate the program
to select the appropriate protocol In our experience re
moving the need for annotation leads to much improved
usability
The adaptive protocols were implemented in Tread
Marks  Our experimental platform is a switched
Mbps Ethernet consisting of eight Mhz Pentium
Pro machines running FreeBSD We use eight applications
to demonstrate the performance of the adaptive proto
cols DFFT CG MG and IS from the NAS benchmark
suite  Water and BarnesHut from the SPLASH bench
mark suite  Gauss from the TreadMarks distribution
and ILINK from the FASTLINK package  The results
indicate that
 Adaptation between single and multiplewriter and dy
namic aggregation perform well in some cases showing sub
stantial performance improvement and never decreasing
performance
 Adaptation between invalidate and update is less suc
cessful with performance improvements that match dy
namic aggregation in some cases and substantial perfor
mance losses in others
The outline of the rest of this paper is as follows
Section II presents the necessary background information
about LRC Section III presents the possible protocol
choices for implementing LRC and the policies and mech
anisms by which the adaptive protocols choose between
their alternatives Section IV describes the experimental
environment Section V describes the applications used
Section VI presents the results of the performance com
parison Section VII discusses related work Section VIII
presents our conclusions
II Programming Model
We assume an explicitly parallel programming model
with primitives for process creation and destruction syn
chronization and shared memory allocation and dealloca
tion Synchronization primitives include mutual exclusion
locks and barriers Shared memory is accessed through
load and store instructions The memory consistency
model presented to the user is release consistency 	RC

a relaxed memory model 
In RC ordinary shared memory accesses are distin
guished from synchronization accesses with the latter cat
egory subdivided into acquire and release accesses Lock
synchronization maps onto acquires and releases in the ob
vious way a lock operation corresponds to an acquire
and an unlock corresponds to a release With barriers a
 barrier arrival corresponds to a release whereas a barrier
departure corresponds to an acquire Roughly speaking
RC requires that before a release by a processor p becomes
visible to another processor q all ordinary shared memory
modications by processor p become visible to processor
q The Lazy Release Consistency 	LRC
 algorithm 
one of the possible RC implementations delays the prop
agation of shared memory modications by processor p to
processor q until q executes an acquire corresponding to a
release by p
Programs without data races ie programs with suf
cient synchronization such that any pair of conicting
memory accesses is separated by a releaseacquire pair pro
duce the same results on an RC or an LRC memory system
as on a conventional sequentiallyconsistent memory sys
tem  Performance however can be much improved by
the use of RC or LRC especially for software implementa
tions of shared memory because the messages propagating
the shared memory modications can be delayed and coa
lesced with the synchronization messages leading to a sub
stantial reduction in communication   In addition
to being dataracefree all synchronization in the program
must be done through the primitives supplied by the run
time system so that it can take the required consistency
actions at synchronization points
We assume that the shared memory is implemented as
a global virtual memory segment shared by all processors
The virtual memory protection hardware is used to detect
access to individual pages Some of these accesses may
cause page faults which then trigger protocol operations
as described in the next section
III Protocols
A Single vs MultipleWriter Protocols
A The Basic Protocols and Their Tradeos
In a singlewriter protocol there is a single writable copy
of a page at any given time  The processor currently
holding the writable copy of a page is called the owner of
that page Several readonly copies of the page may co
exist with the writable owner copy According to the de
nition of RC a readonly copy may be temporarily incon
sistent with the writable copy but it must be brought up
todate when the processor on which it resides synchronizes
with the owner Assume for instance that an invalidate
protocol is used and synchronization is by means of a bar
rier If the owner has modied a particular page it creates
an owner write notice for that page containing its proces
sor id and a version number The barrier protocol causes
the owner write notice to be transmitted to the processors
which have readonly copies of the page and these proces
sors then invalidate their copies On a subsequent access
miss they retrieve the page from the owner The owner
writeprotects his copy during the rst retrieval Before a
processor may write on a page it must obtain ownership
from the current owner The owner is located by means of
the write notice with the highest version number possibly
by forwarding if ownership has changed since this write no
tice was received Once ownership is obtained the pages
version number is incremented by one
In contrast in a multiplewriter protocol there may be
several writable copies of a page on dierent processors 
Each processor with a writable copy records its own mod
ications to the page by a technique called twinning and
ding Pages are initially writeprotected so that the rst
write access to a page causes a protection violation At this
point the system makes a copy of the page the twin and
unprotects the original page To detect what modications
have been made to a page the current copy is compared
wordbyword to the twin and a record of the modica
tions the di is constructed Continuing the above ex
ample when an invalidate protocol is used and synchro
nization is by means of barriers each processor that has
modied a page constructs a write notice for that page
which is forwarded by the barrier protocol to all processors
with copies of that page

A processor might receive several
write notices for a single page These write notices cause
the page to be invalidated On an access miss the dis
corresponding to these write notices have to be retrieved
and applied to the processors current copy of the page
The tradeo between single and multiplewriter proto
cols is dependent on the access pattern to the page and
aects both execution time and memory overhead If mul
tiple processors write concurrently to dierent parts of a
page 	writewrite false sharing
 then multiplewriter pro
tocols achieve better performance because they do not in
cur the cost of transferring the page over the network to
the next writer Even if there is only a single writer it may
be advantageous to use twinning and ding This scenario
occurs when the writer modies only a small portion of the
page The multiplewriter protocol transmits only those
modications while a singlewriter protocol transmits the
entire page
If however only a single processor writes to a page at
any given time and this processor modies a large part
of the page then the singlewriter protocol avoids the cost
of twinning ding and di application without much in
crease in communication More importantly it avoids a
pitfall of the multiplewriter protocol called di accumu
lation  a scenario in which a number of partially or
completely overlapping dis are transmitted signicantly
increasing the amount of communication While it is possi
ble to modify the multiplewriter protocol to eliminate the
overlap there is a high computational cost to pruning use
less data from older dis each time a new di is created It
is more ecient to manage the page in singlewriter mode
Finally while the memory overhead for the singlewriter
protocol is negligible the multiplewriter protocol has to
allocate memory for the twins and the dis This extra
overhead may cause an application to page to disk with a
multiplewriter protocol while running in memory with a
 
The information in the write notices of the multiplewriter protocol
is more complicated than the version number present in the owner
write notices of the singlewriter protocol In particular it contains a
vector timestamp that allows the write notice to be partially ordered
wrt write notices from other processors
singlewriter protocol
A Adapting between Single and MultipleWriters
In the adaptive protocol used in this paper all pages
start out in multiplewriter mode A page may switch to
singlewriter mode by one of two events
 A processor receives a di request for a page and it has
modied the entire page In this case the page is clearly
singlewriter and there is no reduction in communication
by sending a di
 A processor sends out di requests for a page it re
ceives no concurrent dis and the sum of the sizes of the
dis received is bigger than the page size This is indica
tive of the di accumulation phenomenon discussed earlier
Since there are no concurrent dis there is no writewrite
false sharing Looking ahead to the time where a dierent
processor requests the dis for this page keeping the page
in multiplewriter mode would cause more data to be sent
than a page It is therefore more ecient to put the page
in singlewriter mode
A page may switch back to multiplewriter mode at
the onset of writewrite false sharing which is detected
by the ownership refusal protocol a modication to the
singlewriter protocol for locating and transferring owner
ship  On a release 	an unlock or a barrier arrival
 a
processor communicates both its owner write notices and
its multiplewriter write notices On a write fault to a page
in singlewriter mode a processor requests ownership as
in the singlewriter protocol The owner is located using
the owner write notice with the highest version number
This version number is included in the ownership request
message If the recipient of the message is no longer the
owner or if the version number has changed writewrite
false sharing has been detected the ownership request is re
fused and the page is put into multiplewriter mode Oth
erwise ownership is granted the old owner write protects
its copy of the page the requester becomes the new owner
the version number is incremented and the page stays in
singlewriter mode
The essential aspect that needs to be understood about
this protocol is that it correctly detects the presence or ab
sence of writewrite false sharing Consider the example
of a data item protected by a lock and assume that there
is no writewrite false sharing on the page containing that
data item When processor p acquires the lock it receives
the owner write notice from the previous owner q with ver
sion number V  When p writes on the page it incurs a
page fault and it tries to achieve ownership It sends an
ownership message to q including the version number V 
By our assumption that there is no writewrite false shar
ing on the page no other processor has attempted to write
on the page and therefore q is still the owner and the page
versions number is still V  Therefore the ownership is
granted and p becomes the new owner Consider next the
case where there is writewrite false sharing on the page
either because q or some other processor wrote on a dif
ferent part of the page If q wrote to the page it must
have reacquired ownership of the page and thus it must
have incremented the version number If a dierent pro
cessor wrote to the page it must have acquired ownership
and q is no longer the owner In either case ps ownership
request is refused and the page is put in multiplewriter
mode For a more detailed description and a correctness
argument we refer the reader to Amza et al 
A Adapting to Migratory Access
Adaptation to migratory access only makes sense in the
context of an adaptive protocol operating in singlewriter
mode 	or a singlewriter protocol
 where its purpose is to
eliminate the need for explicit ownership messages Com
pared to the base multiplewriter protocol the adaptive
protocol requires an extra message to acquire ownership in
the following scenario A processor takes a read fault on
an invalid page obtains the dis to validate the page and
then later takes a write fault on the page With the base
multiplewriter protocol a twin is created but no messages
are sent at the time of the write fault With the adaptive
protocol an ownership request is sent The scenario de
scribed is that of a migratory access pattern a sequence
of reads followed by a sequence of writes by one processor
with no intervening accesses by other processors 
Detecting migratory access and eliminating the explicit
ownership message is straightforward   If a page
is migratory when a processor performs its rst read from
the page it will fault because the page is invalid Its re
quest for the page will go to the processor that still owns
the page If that processor accessed the page in a simi
lar migratory fashion it will preemptively send ownership
along with the page Later if the page changes access pat
tern for example to producerconsumer the overhead to
switch will be one ownership request
B Adaptive Runtime Aggregation of Pages
B The Basic Protocols and Their Tradeos
Software DSM systems based on virtual memory tech
niques traditionally use the hardware page as the unit of
access detection and as the unit of transfer The single
writer multiplewriter and adaptive protocols discussed
in Section IIIA all follow this approach Depending on
whether a single or multiplewriter protocol is used a di
or a whole page is transferred but in both cases access
detection is done on a perpage basis and the data trans
ferred in a page fault response always pertains to a single
hardware page For simplicity the discussion in this sec
tion is cast in terms of the multiplewriter protocol unless
otherwise noted but it can easily be extended to the single
writer protocol and the adaptive singlewritermultiple
writer protocol described in Section IIIA
Both the unit of access detection and the unit of trans
fer can be increased for instance by using a multiple of the
hardware page size Doing so trades o aggregation vs the
potential for increased false sharing Aggregation reduces
the number of messages exchanged If a processor accesses
several pages in succession a single page fault request and
reply now suce where before multiple exchanges were re
quired As a secondary benet the number of page faults is
also reduced These gains however come at the expense of
potentially increased false sharing False sharing may lead
to an increase in the amount of data exchanged Assume
for instance that processor p writes to successive pages a
and b and processor q accesses only a With the base page
size only the dis for a are transferred but if the page size
is doubled the dis for a and b are transferred Worse
false sharing may also lead to an increase in the number of
messages If processor p writes a processor q writes b and
processor r reads a two message exchanges occur with a
doubled page size one between p and r and one between q
and r where an exchange between p and r suced with the
base page size The eects of false sharing are aggravated
under the singlewriter protocol causing more and larger
page transfers Under the adaptive singlewritermultiple
writer protocol the larger page may be put in multiple
writer mode while the individual hardware pages could
have been handled in singlewriter mode
B The Adaptive Protocol
In this section we present a protocol that continues to
use the hardware page as the unit of detection but adap
tively coalesces pages into page groups for the purpose of
transfer The algorithm monitors the access patterns on
each processor and tries to construct page groups so as to
increase aggregation without incurring the harmful eects
of false sharing
The dis for all of the pages in a group are requested at
the rst fault on any page that is a member of the group
Requests addressed to the same processor are combined
into one message resulting in fewer request messages and
enabling the data transfer to occur in one message as well
Even if the dis must come from dierent processors there
is still an advantage to requesting the dis for all pages in
the group at once because those processors can return the
dis in parallel rather than in sequence
A processor uses two dierent mechanisms for grouping
pages The rst mechanism is based on the past accesses
on that processor itself Essentially the processor groups
pages that were accessed during the previous synchroniza
tion interval In order to avoid packet loss in the network
the implementation limits the maximum number of pages
in a single group to eight Thus more than one group
may be formed at a synchronization point If two or more
groups are formed the pages are assigned to groups in the
order they were accessed The second mechanism is based
on past accesses of other processors It comes into play
only if the rst mechanism did not produce a group for the
missing page The faulting processor checks if the page was
modied by a single processor during the previous synchro
nization interval and if so it requests from that processor
any contiguous pages that were modied during that inter
val Again the number of pages in any group is limited to
eight
In order to allow the membership of a group to change
over time the algorithm keeps every page invalid until the
rst access to that page occurs Thus a page may be kept
invalid even though it has been updated by an access to
another page within the same group When the page fault
handler is triggered by an access to such a page it can
simply change the pages state to valid without requesting
any data In this case the page will remain a part of its
group If however the page is never accessed it will be
dropped from the group at the next synchronization point
Hence this strategy allows the algorithm to adapt to any
change in the programs access pattern over the course of
its execution
C Invalidate vs Update
C The Basic Protocols and Their Tradeos
In an invalidate protocol a page is invalidated when the
processor becomes aware of a remote modication In LRC
this happens at the time of a synchronization A synchro
nization message for instance a lock grant or a barrier
departure message contains a number of 	owner
 write no
tices indicating which pages have been modied When
the processor later accesses one of these pages it incurs an
access miss Depending on whether a single or multiple
writer protocol is in use either the whole page or the dis
are fetched In an update protocol instead the modica
tions to the page are sent with the synchronization message
Pages are never invalidated
The tradeos between invalidate and update protocols
are well known  Update protocols send substantially
more data including data that the processor may never
access or that may be overwritten by newer data before
the processor accesses the data originally sent Invalidate
protocols only retrieve the data for the pages the processor
accesses but they pay the penalty of the access miss fault
and the roundtrip latency to get the modications In ad
dition in releaseconsistent software DSM update proto
cols naturally include aggregation when a processor mod
ies several pages all the modications are sent in a single
message to the other processor	s

C The Adaptive Protocol
The adaptive invalidateupdate protocol updates the
pages that the processor is expected to access and inval
idates the other pages As with the aggregation for invali
date protocols described in Section IIIB we limit a single
update message to contain data for no more than eight
pages in order to avoid packet loss in the network Predic
tion of future accesses may be done in a variety of ways For
programs based on barriers each processor p records the
set of processors from which it receives a page fault request
for a particular page When p arrives at the next barrier if
it has modied a particular page it sends updates for that
page to the processors in the set it has computed during
the interval before the barrier  These processors re
turn negative acknowledgements to these updates if they
receive a second update for a page and have not accessed
the page since the rst update For data protected by a
lock we use the method proposed by among others Mon
nerat and Bianchini  and Speight and Bennett  We
Application Data size Sync Time
	sec

Water  molecules bl 
Barnes K bodies b 
IS   b 
DFFT    b 
MG    b 
CG   	sparse
 b 
Gauss   b 
ILINK CLP b 
TABLE I
Applications input data sets synchronization llocks
bbarriers and sequential execution time
record which pages a processor modies while it holds the
lock Updates for these pages are sent to the next acquirer
of the lock while any other modied pages are invalidated
IV Experimental Environment
Our experimental platform is a network of eight MHz
Pentium Pros running FreeBSD  Each machine has a
K byte secondary cache and a M byte memory The
hardware page size is K bytes The network connecting
the machines is a switched fullduplex Mbps Ethernet
TreadMarks uses the UDPIP protocol for interproces
sor communication The roundtrip latency for a byte
message using the UDPIP protocol is  microseconds
on this platform The time to acquire a lock varies from
 to  microseconds The time for an eight proces
sor barrier is  microseconds The time to obtain a di
varies from  to  microseconds
V Applications
We use eight applications in this study Water and
BarnesHut come from the SPLASH benchmark suite 
Integer Sort 	IS
 DFFT Multigrid 	MG
 and Conjugate
Gradient 	CG
 come from the NAS benchmark suite 
Gauss is a Gaussian elimination kernel distributed with
TreadMarks ILINK is part of the FASTLINK package 
of genetic linkage analysis programs
Table I summarizes the relevant characteristics of the
applications It includes for each application the data set
size used the method of synchronization 	locks barriers
or both
 and the sequential running times Sequential run
ning times were obtained by removing all synchronization
from the TreadMarks programs these times were used as
the basis for the speedup gures reported later in the pa
per
VI Results
For each of the applications we show speedups under the
following scenarios
 the single and multiplewriter protocols and the adap
tive singlewritermultiplewriter protocol
SW MW Adapt SW/MW
Water Barnes FFT IS MG CG Gauss Ilink
S
pe
ed
up
0.00
1.00
2.00
3.00
4.00
5.00
6.00
Fig  Speedup comparison singlewriter multiplewriter and
adaptive protocols
 the adaptive singlewritermultiplewriter protocol plus
dynamic aggregation and
 the adaptive singlewritermultiplewriter protocol plus
invalidateupdate adaptation including aggregation of the
updates
The eects of dynamic aggregation are independent of
whether the base protocol is the singlewriter multiple
writer or adaptive singlewritermultiplewriter protocol
Hence we only present the results for dynamic aggregation
using the base protocol with the best overall performance
the adaptive singlewritermultiplewriter protocol
Similarly the eects of adaptation between invalidate
and update are the same for the singlewriter multiple
writer and adaptive singlewritermultiplewriter protocol
Furthermore since the update part of the adaptive in
validateupdate protocol inherently includes aggregation
and since aggregation is always benecial with invalidate
protocols we compare the invalidatebased adaptive single
writermultiplewriter protocol with aggregation to the
adaptive invalidateupdate singlewritermultiplewriter
protocol
A Single vs MultipleWriter Protocol
Figure  shows the speedup on eight processors for each
of the applications using the singlewriter protocol the
multiplewriter protocol and the protocol that adapts be
tween the two including the adaptation to migratory ac
cesses An invalidate protocol using the hardware page size
is used as in the base TreadMarks system
We rst compare the nonadaptive single and multiple
writer protocols As expected the amount of writewrite
false sharing determines the tradeo The singlewriter
protocol performs better than the multiplewriter proto
col on applications with no writewrite false sharing and
large overlapping dis 	IS
 performs comparably on appli
cations with low writewrite false sharing 	Water DFFT
Gauss
 and worse for applications with high writewrite
false sharing 	Barnes MG CG and ILINK
 Comparing
Adapt SW/MW(no aggr) Adapt SW/MW (with aggr)
Water Barnes FFT IS MG CG Gauss Ilink
S
pe
ed
up
0.00
1.00
2.00
3.00
4.00
5.00
6.00
Fig  Speedup comparison protocols with and without dynamic
aggregation
the adaptive to the nonadaptive protocols we see from
Figure  that the adaptive protocol matches or exceeds the
speedup of the best of the nonadaptive protocols
The adaptation that optimizes migratory access only af
fects IS None of the other programs such as Water that
have migratory data modify the entire page or suer from
signicant di accumulation Consequently they do not
switch to singlewriter mode and thus the migratory op
timization is not needed For IS it limits the ownership
messages to one per page per iteration instead of eight
We do not present the memory demands for the proto
cols here but we oer the following anecdote for a larger
DFFT data set 	      
 the singlewriter and
adaptive protocols performed well running completely in
main memory while the multiplewriter protocol paged be
cause of the twins and dis it stored causing a fold in
crease in execution time 	See Amza et al  for a detailed
account

B Dynamic Aggregation of Pages
Figure  shows the speedups achieved with dynamic
page aggregation in addition to adapting between single
and multiplewriter and adapting to migratory access
As a baseline for comparison we reiterate in Figure 
the speedups from Figure  for the adaptive single
writermultiplewriter protocol Five out of the eight ap
plications benet from dynamic page aggregation Barnes
Hut DFFT IS MG and CG The benets for IS derive
from the aggregation based on write accesses by other pro
cessors while the benets for the other four applications
derive from the past access patterns by that processor it
self In IS which sees the greatest benets processors ex
change a large amount of data leading to a signicant re
duction in the number of messages with the attendant per
formance benets A similar argument explains the some
what smaller improvements for BarnesHut DFFT CG
and MG
Surprisingly three of the applications that benet from
Adapt SW/MW (with aggr, inval)
Adapt SW/MW (with aggr, adaptive update/inval)
Water Barnes FFT IS MG CG Gauss Ilink
S
pe
ed
up
0.00
1.00
2.00
3.00
4.00
5.00
6.00
Fig 	 Speedup comparison invalidate and adaptive invali
date
update protocols
aggregation BarnesHut MG and CG suer from write
write false sharing This illustrates the fact that dynamic
page aggregation can reduce the number of messages with
out increasing false sharing
C Invalidate vs Update
Figure  shows the speedups for the protocol that adapts
between invalidate and update 	including aggregation of
the updates
 adapts between single and multiplewriter
and adapts to migratory access The results are shown
along side those for the invalidate protocol that performs
dynamic page aggregation from Figure 
The benets of automatic adaptation between invalidate
and update are questionable when such a protocol is com
pared to a base protocol that performs aggregation Its
benets are limited to avoiding page faults and roundtrip
latencies resulting in a small improvement In many cases
these improvements are oset by the additional data trans
fer Typically the additional data transfer comes from
changes in the sharing pattern For example IS consists
of a number of iterations each of which is divided into a
number of migratory phases followed by a phase in which
the data produced by any single processor is consumed by
all other processors This latter phase causes the adaptive
algorithm to send updates to all processors in the rst mi
gratory phase of the next iteration The negative acknowl
edgements halt these updates after two migratory phases
but the large amount of unnecessary data sent in these
two phases causes performance to deteriorate substantially
Adaptation between invalidate and update is however at
tractive in some cases if the invalidate mode of the base
protocol does not support aggregation
VII Related Work
A large number of software shared memory systems have
been built 	eg      
 Although the
work described here is done in the context of a specic sys
tem TreadMarks many of the ideas are applicable to other
systems as well First the adaptation between single and
multiplewriter protocols carries over to all pagebased sys
tems Second aggregation should prove to be benecial to
all systems especially the ones that use smaller consistency
units Finally the tradeo between update and invalidate
also applies to these other systems although the nature
of the tradeo may change substantially if compiler sup
port is used to determine the choice between update and
invalidate  
The multiplewriter protocol described in this paper
is the one in use with the current version of Tread
Marks  The singlewriter protocol is a variation of
the one presented by Keleher  The adaptive single
writermultiplewriter protocol extends our earlier work 
on this topic In this earlier work we chose a protocol that
started out in singlewriter mode because of its reduced
memory use 	no twins are ever made for pages that remain
in singlewriter mode
 We found that the same reduction
in memory use can be achieved by a protocol that starts out
in multiplewriter mode by not creating the initial twin
which contains all zeroes Starting in multiplewriter mode
allows for a straightforward adaptation according to the
size of the dis
The adaptive DSM system described by Monnerat and
Bianchini  is most closely related to our work They also
investigate the adaptation between single and multiple
writer protocols and adaptation between invalidate and
update In their system pages are classied as migra
tory producerconsumer or falsely shared Singlewriter
mode is used for migratory and producerconsumer pages
while the falsely shared pages are maintained in multiple
writer mode Updates are used only for migratory and
producerconsumer pages Keleher et al  and Espeight
et al  have also investigated the benets of allowing a
software shared memory system the choice between invali
date and update However to the best of our knowledge
all of these studies were conducted in the absence of aggre
gation for the invalidate protocol inating the perceived
benets of update We have demonstrated that commu
nication aggregation is the key to improving performance
in both invalidate and update protocols Adding dynamic
aggregation to the invalidate protocol provides the same
benets as using an update protocol without the risk of
sending extra messages
Amza et al  investigated the benets of dynamic page
aggregation They did not however combine aggrega
tion with other forms of adaptation Lu et al  found
that aggregation is the main reason that messagepassing
programs outperform 	software
 sharedmemory programs
Overall they found that for six out of their eight appli
cations the speedup on TreadMarks was within  of
that achieved by PVM With the best static page aggre
gation for each of those six applications the speedup on
TreadMarks improved to within  of the speedup on
PVM These results were obtained on two platforms one of
which the Mbps ATM network of eight SPARCstation
 Model  workstations is similar to the platform used
in this paper
Our adaptive singlewritermultiplewriter protocol ad
dresses the most extreme cases of a less common problem
di accumulation found by Lu et al  Di accumu
lation in IS contributed to the worst performance with
respect to PVM TreadMarks speedup was only  of
PVMs With di accumulation manually removed the
speedup improved to within  of the speedup on PVM
Our adaptive protocol automatically achieves a similar im
provement
Several other systems both hardware and software have
investigated congurability or adaptivity as a means of im
proving performance
Shasta  features congurable consistency units to ad
dress the requirements of applications with negrain shar
ing at the expense of higher memory overheads
Munin  uses multiple protocols to handle data with
dierent access characteristics The novelty in our work is
that it chooses automatically between dierent protocols
In Munin the choice of protocol was based on somewhat
burdensome user annotations
Cashmere  improves on the homebased protocol in
troduced by Zhou et al  allowing dynamic migration of
the home node The homebased protocol allows a single
writer optimization that avoids ding overhead when the
home node is the only writer for the page The downside is
that whole pages are fetched on faults even if the amount
of data modied is small
Dubnicki and LeBlanc  proposed a scheme to reduce
the impact on performance due to a mismatch between
the cache block size and the sharing patterns exhibited by
a given application They adjusted the amount of data
stored in a cache block according to recent reference pat
terns They found that the adjustable cacheblocksize im
plementation did better than the best xedsize implemen
tations for most of the programs in their suite
The adaptation to migratory behavior was rst suggested
by Cox and Fowler  and Stenstrom et al  in the
context of hardware shared memory machines
Another form of adaptivity that is important in networks
of workstations is adapting to environmental characteris
tics such as processor and network load   This form
of adaptivity is orthogonal to the one discussed in this pa
per
VIII Conclusions
We have described software DSM protocols that auto
matically adapt on a perpage basis to the access patterns
in the application The protocols dynamically choose be
tween single and multiplewriter protocols Pages can be
dynamically aggregated into larger page groups Finally
the protocols choose dynamically between invalidate and
update All adaptation is automatic
The choice between the single and multiplewriter pro
tocols is based on the presence of writewrite false sharing
and on write granularity In addition the protocol detects
migratory behavior and chooses a version of the protocol
optimized accordingly Aggregation uses records of earlier
accesses by a processor to coalesce pages into page groups
in the expectation that those pages will be accessed again
by the processor The choice between invalidate and up
date is based on whether we expect the destination to ac
cess the modied data before it is overwritten or not The
three adaptations can easily be combined
Adaptation between single and multiplewriter and dy
namic aggregation proved to be the most benecial never
causing any deterioration and providing substantial im
provement for some applications Our automatic adap
tation between invalidate and update was less successful
showing at best gains equal to the dynamic aggregation
adaptation and at worst serious performance deterioration
We speculate that it may be dicult to nd a fully auto
matic purely runtime algorithm for adaptation between
invalidate and update and that either compiler or user in
put may be necessary to achieve good performance
Acknowledgements
This work was supported in part by NSF grants CCR
 CCR CCR CCR
CCR CCR CDA and MIP
 by the Texas TATP program under Grant 
 and by grants from IBM Corporation and from Tech
Sym Inc
References
 C Amza AL Cox S Dwarkadas P Keleher H Lu R Raja
mony W Yu and W Zwaenepoel TreadMarks Shared mem
ory computing on networks of workstations IEEE Computer
 February 
 C Amza AL Cox S Dwarkadas and W Zwaenepoel Software
DSM protocols that adapt between single writer and multiple
writer In Proceedings of the Third International Symposium
on HighPerformance Computer Architecture pages 
February 
	 C Amza AL Cox K Rajamani and W Zwaenepoel Trade
os between false sharing and aggregation in software distributed
shared memory In Proceedings of the th Symposium on the
Principles and Practice of Parallel Programming pages 
June 
 D Bailey J Barton T Lasinski and H Simon The NAS
parallel benchmarks Technical Report TR RNR NASA
Ames August 
 HE Bal MF Kaashoek and AS Tanenbaum Orca A lan
guage for parallel programming of distributed systems IEEE
Transactions on Software Engineering pages  June

 RD Blumofe and PA Lisiecki Adaptive and reliable parallel
computing on network of workstations In Proceedings of the
USENIX  Annual Technical Symposium January 
 N Carriero E Freeman D Gelernter and D Kaminsky Adap
tive parallelism and piranha IEEE Computer  January

 JB Carter JK Bennett and W Zwaenepoel Techniques for
reducing consistencyrelated information in distributed shared
memory systems ACM Transactions on Computer Systems
			 August 
 A L Cox and RJ Fowler Adaptive cache coherency for detect
ing migratory shared data In Proceedings of the th Annual
International Symposium on Computer Architecture pages 
 May 	
 C Dubnicki and T LeBlanc Adjustable block size coherent
caches In Proceedings of the th Annual International Sympo
sium on Computer Architecture pages  May 
 S Dwarkadas AL Cox and W Zwaenepoel An integrated
compiletime
runtime software distributed shared memory sys
tem In Proceedings of the th Symposium on Architectural Sup
port for Programming Languages and Operating Systems pages
 October 
 SJ Eggers and RH Katz A characterization of sharing in par
allel programs and its application to coherency protocol evalu
ation In Proceedings of the 	th Annual International Sympo
sium on Computer Architecture pages 				 May 
	 K Gharachorloo D Lenoski J Laudon P Gibbons A Gupta
and J Hennessy Memory consistency and event ordering in
scalable sharedmemory multiprocessors In Proceedings of the
th Annual International Symposium on Computer Architec
ture pages  May 
 P Keleher The relative importance of concurrent writers and
weak consistency models In Proceedings of the th Interna
tional Conference on Distributed Computing Systems pages 
 May 
 P Keleher A L Cox S Dwarkadas and W Zwaenepoel An
evaluation of softwarebased release consistent protocols Jour
nal of Parallel and Distributed Computing  October

 P Keleher A L Cox and W Zwaenepoel Lazy release consis
tency for software distributed shared memory In Proceedings of
the th Annual International Symposium on Computer Archi
tecture pages 	 May 
 L Lamport How to make a multiprocessor computer that cor
rectly executes multiprocess programs IEEE Transactions on
Computers C September 
 H Lu S Dwarkadas A L Cox and W Zwaenepoel Quantify
ing the performance dierences between PVM and TreadMarks
Journal of Parallel and Distributed Computing 	
June 
 LR Monnerat and R Bianchini Eciently adapting to shar
ing patterns in software DSMs In Proceedings of the Fourth
International Symposium on HighPerformance Computer Ar
chitecture February 
 DJ Scales K Gharachorloo and CA Thekkath Shasta A
low overhead softwareonly approach for supporting negrain
shared memory In Proceedings of the th Symposium on Ar
chitectural Support for Programming Languages and Operating
Systems October 
 AA Schaer Faster linkage analysis computations for pedigrees
with loops or unused alleles Human Heredity 	 jul

 JP Singh WD Weber and A Gupta SPLASH Stanford
parallel applications for sharedmemory Technical Report CSL
TR Stanford University April 
	 WE Speight and JK Bennett Using multicast and multi
threading to reduce communication in software DSM systems
In Proceedings of the Fourth International Symposium on High
Performance Computer Architecture February 
 P Stenstrom M Brorsson and L Sandberg An adaptive cache
coherence protocol optimized for migratory sharing In Proceed
ings of the th Annual International Symposium on Computer
Architecture May 	
 R Stets S Dwarkadas N Hardavellas G Hunt L Kon
tothanassis S Parthasarathy and M Scott CashmereL Soft
ware coherent shared memory on a clustered remote write net
work In Proceedings of the th ACM Symposium on Operating
Systems Principles October 
 WD Weber and A Gupta Analysis of cache invalidation pat
terns in multiprocessors In Proceedings of the 
rd Symposium
on Architectural Support for Programming Languages and Op
erating Systems pages 	 April 
 Y Zhou L Iftode and K Li Performance evaluation of two
homebased lazy release consistency protocols for shared virtual
memory systems In Proceedings of the Second USENIX Sym
posium on Operating System Design and Implementation pages
 nov 
