Lazy Release Consistency for Software Distributed Shared Memory by Cox, A.L. et al.
Lazy Release Consistency
for Software Distributed Shared Memory
Pete Keleher  Alan L Cox  and Willy Zwaenepoel
Department of Computer Science
Rice University
March   
Abstract
Relaxed memory consistency models  such as release
consistency  were introduced in order to reduce the im
pact of remote memory access latency in both software
and hardware distributed shared memory DSM How
ever  in a software DSM  it is also important to reduce
the number of messages and the amount of data ex
changed for remote memory access Lazy release con 
sistency is a new algorithm for implementing release
consistency that lazily pulls modications across the
interconnect only when necessary Tracedriven simula
tion using the SPLASH benchmarks indicates that lazy
release consistency reduces both the number of mes
sages and the amount of data transferred between pro
cessors These reductions are especially signicant for
programs that exhibit false sharing and make extensive
use of locks
  Introduction
Over the past few years  researchers in hardware dis
tributed shared memory DSM have proposed relaxed
memory consistency models to reduce the latency as
sociated with remote memory accesses     	  
  
For instance  in release consistency RC 	  writes to
shared memory by processor p
 
need to be performed
become visible at another processor p

only when a
subsequent release of p
 
performs at p

 This relaxation
of the memory consistency model allows the DASH im
plementation of RC  to combat memory latency by
pipelining writes to shared memory see Figure  The
processor is stalled only when executing a release  at
which time it must wait for all its previous writes to
perform
 
This work is supported in part by NSF Grant No  CDA
and Texas ATP Grant No    Pete Keleher was sup
ported by a NASA Fellowship 
w(x) w(y) w(z) rel
x y z
p1
p2
Figure   Pipelining Remote Memory
Accesses in DASH
In software DSMs  it is also important to reduce the
number of messages exchanged Sending a message in
a software DSM is more expensive than in a hardware
DSM  because it may involve traps into the operating
system kernel  interrupts  context switches  and the ex
ecution of several layers of networking software Ideally 
the number of messages exchanged in a software DSM
should equal the number of messages exchanged in a
message passing implementation of the same applica
tion Therefore  Munins writeshared protocol   a
software implementation of RC  buers writes until a
release  instead of pipelining them as in the DASH im
plementation At the release  all writes going to the
same destination are merged into a single message see
Figure 
Even Munins writeshared protocol may send more
messages than a message passing implementation of the
w(x) w(y) w(z) rel
x,y,z
p1
p2
Figure  Merging of Remote Memory
Updates in Munin

same application Consider the example of Figure  
where processors p
 
through p

repeatedly acquire the
lock l  write the shared variable x  and then release l
If an update policy is used in conjunction with Munins
writeshared protocol  and x is present in all caches 
then all of these cached copies are updated at every
release Logically  however  it suces to update each
processors copy only when it acquires l This results in
a single message exchange per acquire  as in a message
passing implementation This problem is not peculiar
to the use of an update policy Similar examples can
be constructed for an invalidate policy
Lazy release consistency LRC is a new algorithm
for implementing RC  aimed at reducing both the num
ber of messages and the amount of data exchanged
Unlike eager algorithms such as Munins writeshared
protocol  lazy algorithms such as LRC do not make
modications globally visible at the time of a release
Instead  LRC guarantees only that a processor that ac
quires a lock will see all modications that precede
the lock acquire The term preceding in this context
is to be interpreted in the transitive sense informally 
a modication precedes an acquire  if it occurs before
any release such that there is a chain of releaseacquire
operations on the same lock  ending with the current
acquire see Section  for a precise denition For in
stance  in Figure   all modications that occur in pro
gram order before any of the releases in p
 
through p

precede the lock acquisition in p

 With LRC  modi
cations are propagated at the time of an acquire Only
the modications that precede the acquire are sent to
the acquiring processor The modications can be pig
gybacked on the message that grants the lock  further
reducing message trac Figure  shows the message
trac in LRC for the same shared data accesses as in
Figure  l and x are sent in a single message at each
acquire
By not propagating modications globally at the time
of the release  and by piggybacking data movement on
lock transfer messages  LRC reduces both the num
acq  r(x)
acq  w(x)  rel
w(x)  rel
acq  w(x)  rel
p1
p2
p3
p4
Figure  Repeated Updates of Cached
Copies in Eager RC
ber of messages and the amount of data exchanged
We present the results of a simulation study  using
the SPLASH benchmarks  that conrms this intuition
LRC is  however  more complex to implement than ea
ger RC because it must keep track of the precedes
relation We intend to implement LRC to evaluate its
runtime cost The message and data reductions seen in
our simulations seem to indicate that LRC will outper
form eager RC in a software DSM environment
The outline of the rest of this paper is as follows In
Section   we state the denition of RC In Section  
we present an eager implementation of RC based on
Munins writeshared protocol In Section   we dene
LRC and outline its implementation In Section   we
describe a comparison through simulation of eager RC
and LRC We briey discuss related work in Section  
and we draw conclusions and explore avenues for further
work in Section 
 Release Consistency
Release consistency RC 	 is a form of relaxed mem
ory consistency that allows the eects of shared memory
accesses to be delayed until certain specially labeled ac
cesses occur RC requires shared memory accesses to
be labeled as either ordinary or special  Within the
special category  accesses are divided into those labeled
sync and nsync  and sync accesses are further subdi
vided into acquires and releases
Denition   A system is release consistent if
 Before an ordinary access is allowed to perform
with respect to any other processor all previous ac
quires must be performed
 Before a release is allowed to perform with respect
to any other processor all previous ordinary reads
and writes must be performed
 Special accesses are sequentially consistent with re 
spect to one another
acq  r(x)
acq  w(x)  rel
w(x)  rel
acq  w(x)  rel
p1
p2
p3
p4
Figure  Message Trac in LRC

A write is performed with respect to another processor
when reads by that processor return the new writes or
a subsequent writes value Reads are performed with
respect to another processor when a write issued by that
processor can no longer aect the value returned by the
read Accesses are performed when they are performed
with respect to all processors in the system
Properly labeled programs 	 produce the same re
sults on RC memory as they would on sequentially con
sistent memory  Informally  a program is properly
labeled if there are enough accesses labeled as ac
quires or releases  such that  for all legal interleavings
of accesses  each pair of conicting ordinary accesses is
separated by a releaseacquire chain Two accesses con 
ict if they reference the same memory location  and at
least one of them is a write
RC implementations can delay the eects of shared
memory accesses as long as they meet the constraints
of Denition 
 Eager Release Consistency
We base our eager RC algorithm on Munins write
shared protocol  A processor delays propagating
its modications to shared data until it comes to a re 
lease At that time  it propagates the modications to
all other processors that cache the modied pages For
an invalidate protocol  this simply entails sending inval
idations for all modied pages to the other processors
that cache these pages In order to limit the amount of
data exchanged  an update protocol sends a di of each
modied page to other cachers A di describes the
modications made to the page  which are then merged
in the other cached copies In either case  the release
blocks until acknowledgments have been received from
all other cachers
No consistencyrelated operations occur on an ac
quire The protocol locates the processor that last exe
cuted a release on the same variable  and the resulting
value is sent from the last releaser to the current ac
quirer
On an access miss  a message is sent to the directory
manager for the page The directory manager forwards
the request to the current owner  and the current owner
sends the page to the processor that incurred the access
miss
 Lazy Release Consistency
In LRC  the propagation of modications is further
postponed until the time of the acquire At this time 
the acquiring processor determines which modications
it needs to see according to the denition of RC To do
so  LRC uses a representation of the happened before 
 partial order introduced by Adve and Hill  The
happened before  partial order is a formalization of the
preceding relation mentioned in Section 
  The happened before  Partial Order
We summarize here the relevant aspects of the deni
tions of happened before  
Denition   Shared memory accesses are partially
ordered by happenedbefore denoted
hb 
  de	ned as
follows
 If a
 
and a

are accesses on the same processor
and a
 
occurs before a

in program order then
a
 
hb 
  a


 If a
 
is a release on processor p
 
 and a

is an
acquire on the same memory location on processor
p

 and a

returns the value written by a
 
 then
a
 
hb 
  a


 If a
 
hb 
  a

and a

hb 
  a

 then a
 
hb 
  a


RC requires that before a processor may continue
past an acquire  all shared accesses that precede the
acquire according to
hb 
  must be performed at the ac
quiring processor LRC guarantees that this property
holds by propagating write notices on the message that
eects a releaseacquire pair A writenotice is an in
dication that a page has been modied in a particular
interval  but it does not contain the actual modica
tions Writenotices and actual values of modications
may be sent t dierent times in dierent messages
  WriteNotice Propagation
We divide the execution of each processor into distinct
intervals  a new interval beginning with each special
access executed by the processor We dene a happens 
before  partial order between intervals in the obvious
way an interval i
 
precedes an interval i

according to
hb 
    if all accesses in i
 
precede all accesses in i

ac
cording to
hb 
   An interval is said to be performed at a
processor if all modications made during that interval
have been performed at that processor
Let V
p
i be the vector timestamp  for interval i
of processor p The number of elements in the vector
V
p
i is equal to the number of processors The entry
for processor p in V
p
i is equal to i The entry for pro
cessor q  p in V
p
i denotes the most recent interval
of processor q that has performed at p
On an acquire  the acquiring processor  p  sends its
current vector timestamp V
p
to the previous releaser 

q Processor q uses this information to send p the write
notices for all intervals of all processors that have per
formed at q but have not yet performed at p Releases
are purely local operations in LRC  no messages are
exchanged
  Data Movement Protocols
  Multiple Writer Protocols
Both Munin and LRC allow multiple writer protocols
Multiple processors can write to dierent parts of the
same page concurrently  without intervening synchro
nization This is in contrast to the exclusivewriter pro
tocol used  for instance  in DASH 	  where a processor
must obtain exclusive access to a cache line before it
can be modied Experience with Munin  indicates
that multiplewriter protocols perform well in software
DSMs  because they can handle false sharing without
generating large amounts of message trac Given the
large page sizes in software DSMs  false sharing is an im
portant problem Exclusivewriter protocols may cause
falsely shared pages to pingpong back and forth be
tween dierent processors Multiplewriter protocols al
low each processor to write into a falsely shared page
without any message trac The modications of the
dierent processors are later merged using the dis de
scribed in Section 
 Invalidate vs Update
In the case of an invalidate protocol  the acquiring pro
cessor invalidates all pages in its cache for which it re
ceived writenotices In the case of an update proto
col  the acquiring processor updates those pages Let
i be the current interval For each page in the cache 
dis must be obtained from all concurrent last modi 
	ers These are all intervals j  such that j
hb 
  i  the
page was modied in interval j  and there is no interval
k  such that j
hb 
  k
hb 
  i  in which the modication from
interval j was overwritten
 Access Misses
On an access miss  a copy of the page may have to be
retrieved  as well as a number of dis The modica
tions summarized by the dis are then merged into the
page before it is accessed
On an access miss during interval i  dis must be
obtained for all intervals j  such that j
hb 
  i  the missing
page was modied in interval j  and there is no interval
k  such that j
hb 
  k
hb 
  i  in which the modication from
interval j was overwritten
If the processor still holds an invalidated copy of
the page  LRC does not send the entire page over the
interconnect The writenotices contain all the infor
mation necessary to determine which dis need to be
applied to this copy of the page in order to bring it up
todate The happened before  partial order species
the order in which the dis need to be applied This
optimization reduces the amount of data sent
 Simulation
We present the results of a simulation study based on
multiprocessor traces of ve sharedmemory application
programs from the SPLASH suite  We measured
the number of messages and the amount of data ex
changed by each program for an execution using each of
four protocols lazy update LU  lazy invalidate LI 
eager update EU  and eager invalidate EI We then
relate the communication behavior to the shared mem
ory access patterns of the application programs
 Methodology
A trace was generated from a processor execution of
each program using the Tango multiprocessor simula
tor  These traces were then fed into our protocol
simulator We simulated page sizes from  to 	

bytes
We assume innite caches and reliable FIFO commu
nication channels We do not assume any broadcast or
multicast capability of the network
 Message Counts
The SPLASH programs use barriers and exclusive locks
for synchronization Communication occurs on barrier
arrival and departure  on lock and unlock  and on an
access miss Table  shows the message count for each
of these events under each of the protocols
A miss costs either two or three messages for the
eager protocols  depending on whether or not the di
rectory manager has a valid copy of the page see Sec
tion  For the lazy protocols  a miss requires collecting
dis from the concurrent last modi	ers of the page see
Section 
For a lock operation  three messages are used by all
four protocols for nding and transferring the lock In
addition  in LU  the new lock holder collects all the dis
necessary to bring its cached pages uptodate  causing
h additional messages No extra messages are required
at this time for LI  because the invalidations are pig
gybacked on the lock transfer message Also  no addi
tional messages are required for EU and EI
On unlocks  the eager protocols send writenotices to
all cachers of locally modied pages  using c messages
The lazy protocols do not communicate on unlocks

Access Miss Locks Unlocks Barriers
LI m   n
LU m h  nu
EI  or   c n  v
EU   c n  u
m   concurrent last modiers for the missing page
h   other concurrent last modiers for any local page
c   other cachers of the page
n   processors in system
p   pages in system
u 
P
n
i 
 other cachers of pages modied by i
v 
P
p
i 
 excess invalidators of page i
Table   Shared Memory Operation Message Costs
Barriers are implemented by sending an arrival mes
sage to the barrier master and waiting for the return of
an exit message Consequently  n   messages are
used to implement a barrier In addition  both update
protocols require u messages to send updates to all
processors caching modied pages The LI protocol re
quires no additional messages  because invalidations are
piggybacked on the messages used for implementing the
barrier The EI protocol may require a small number of
additional messages v to resolve multiple invalidations
of a single page
 SPLASH Program Suite
  LocusRoute
LocusRoute is a VLSI cell router The major data
structure is a cost grid for the cell  a cells cost being
the number of wires already running through it Work
is allocated to processors a wire at a time Synchro
nization is accomplished almost entirely through locks
that protect access to a central task queue
Data movement in LocusRoute is largely migra
tory  locks dominate the synchronization  and data
moves according to lock accesses As page size in
creases  false sharing also becomes important Both
of these factors favor lazy protocols
Figures  and  show LocusRoutes performance
The lazy protocols reduce the number of messages and
the amount of data exchanged  for all page sizes
 Cholesky Factorization
Cholesky performs the symbolic and numeric portions
of a Cholesky factorization of a sparse positive denite
LI LU EI EU
Page Size (bytes)
 8192  4096  2048  1024  512
M
es
sa
ge
s
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
1800000
2000000
2200000
2400000
2600000
Figure  LocusRoute Messages
LI LU EI EU
Page Size (bytes)
 8192  4096  2048  1024  512
D
at
a 
(kb
yte
s)
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
1100000
1200000
1300000
1400000
1500000
1600000
Figure  LocusRoute Data

matrix Locks are used to control access to a global
task queue and to arbitrate access when simultaneous
supernodal modications attempt to modify the same
column No barriers are used
Data motion in Cholesky is largely migratory  as in
LocusRoute The resulting performance of Cholesky is
therefore also similar to that of LocusRoute Figures 
and 	 show that the lazy protocols reduce the number
of messages and the amount of data exchanged  for all
page sizes
 MPD
MP D simulates rareed hypersonic airow over an ob
ject using a Monte Carlo algorithm Each timestep in
volves several barriers  with locks used to control access
to global event counters
The message trac for MP D is dominated by access
misses Figures 
 and  show MP Ds performance The
lazy protocols exchange less data than the eager ones 
because they only need to send dis on an access miss
and not full pages  as do the eager protocols The up
date protocols exchange fewer messages  because they
incur fewer access misses
LI LU EI EU
Page Size (bytes)
 8192  4096  2048  1024  512
M
es
sa
ge
s
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
1100000
1200000
1300000
1400000
1500000
1600000
1700000
1800000
Figure  Cholesky Messages
LI LU EI EU
Page Size (bytes)
 8192  4096  2048  1024  512
D
at
a 
(kb
yte
s)
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
500000
550000
Figure 	 Cholesky Data
LI LU EI EU
Page Size (bytes)
 8192  4096  2048  1024  512
M
es
sa
ge
s
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
1100000
1200000
1300000
1400000
1500000
Figure 
 MPD Messages

LI LU EI EU
Page Size (bytes)
 8192  4096  2048  1024  512
D
at
a 
(kb
yte
s)
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
1100000
1200000
1300000
1400000
1500000
Figure   MPD Data
 Water
Water performs an Nbody molecular dynamics simula
tion  evaluating forces and potentials in a system of wa
ter molecules in the liquid state At each timestep  ev
ery molecules velocity and potential is computed from
the inuences of other molecules within a spherical cut
o range Several barriers are used to synchronize each
timestep  while locks are used to control access to a
global running sum and to each molecules force sum
Of the ve benchmark programs  Water has the least
communication Figures  and  show the message
and data trac for Water While the lazy protocols
use only slightly fewer messages than eager protocols
for large page sizes  their data totals are signicantly
lower because they can often avoid bringing an entire
page across the network on an access miss
 Pthor
Pthor is a parallel logic simulator The major data
structures represent logic elements  wires between ele
ments  and perprocessor work queues Locks are used
to protect access to all three types of data structures
Barriers are used only when deadlock occurs and all
LI LU EI EU
Page Size (bytes)
 8192  4096  2048  1024  512
M
es
sa
ge
s
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
Figure    Water Messages
LI LU EI EU
Page Size (bytes)
 8192  4096  2048  1024  512
D
at
a 
(kb
yte
s)
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
22000
24000
26000
28000
30000
Figure   Water Data

task queues are empty
In Pthor  each processor has a set of pages that it
modies However  these pages are also frequently read
by the other processors Under an invalidation protocol 
this causes a large number of invalidations and later
reloads
Figures  and  show Pthors performance Data
totals for EI are particularly high  because frequent
reloads cause the entire page to be sent The message
count for LI is higher than for LU  because LI has more
access misses
  Summary
The SPLASH programs can be divided into two cate
gories based on their synchronization and sharing be
havior The rst category is characterized by heavy
use of barrier synchronization This category includes
the MP D and Water programs These programs per
formed poorly with invalidate protocols and large page
sizes Although barriers result in nearly the same num
ber of messages under both eager and lazy protocols 
even these programs have enough lock synchronization
for the lazy protocols to reduce the number of messages
and the amount of data exchanged
The second category is characterized by migratory
access to data that is controlled by locks This cat
LI LU EI EU
Page Size (bytes)
 8192  4096  2048  1024  512
M
es
sa
ge
s
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
8000000
9000000
10000000
11000000
12000000
Figure   Pthor Messages
LI LU EI EU
Page Size (bytes)
 8192  4096  2048  1024  512
D
at
a 
(kb
yte
s)
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
Figure   Pthor Data
egory includes LocusRoute  Cholesky and Pthor In
Cholesky and Pthor  the locks protect centralized work
queues  while the locks in LocusRoute protect access to
individual cost array elements The use of locks tends
to cause the sharing patterns to closely follow synchro
nization Since the lazy protocols move data according
to synchronization  they handle this type of synchro
nization much better than eager protocols
LU performed well for both categories of programs
In contrast  EU often performed worse than the inval
idate protocols  because it does not handle migratory
data very well LU sends fewer messages than EU for
migratory data because updates are only sent to the
next processor to acquire the lock that controls access
to the data
In all of the programs  the number of processors shar
ing a page is increased by false sharing Multiplewriter
RC protocols reduce the impact of false sharing by per
mitting ordinary accesses to a page by dierent proces
sors to be performed concurrently However  the eager
protocols still perform communication at synchroniza
tion points between processors sharing a page  but not
the data within the page Lazy protocols eliminate this
communication  because processors that falsely share
data are unlikely to be causally related This observa
tion is consistent with the results of our simulations
	
 Related Work
Ivy  was the rst pagebased distributed shared
memory system The shared memory implemented by
Ivy is sequentially consistent  and does not allow mul
tiple writers
Clouds  uses programbased segments rather than
pages as the granularity of consistency In addition 
Clouds permits segments to be locked down at a single
processor to prevent pingponging
Release consistency was introduced by Gharachorloo
et al 	 It is a renement of weak consistency  dened
by Dubois and Scheurich  The DASH multiproces
sor takes advantage of release consistency by pipelining
remote memory accesses  Pipelining reduces the
impact of remote memory access latency on the proces
sor
Munin  was the rst software distributed shared
memory system to use release consistency Munins im
plementation of release consistency merges updates at
release time  rather than pipelining them  in order to
reduce the number of messages transferred between pro
cessors Munin uses multiple consistency protocols to
further reduce the number of messages
Ahamad et al dened a relaxed memory model
called causal memory  Causal memory diers from
RC because conicting pairs of ordinary memory ac
cesses establish causal relationships In contrast  under
RC  only special memory accesses establish causal rela
tionships
Entry
consistency  dened by Bershad and Zekauskas   is
another related relaxed memorymodel EC diers from
RC because it requires all shared data to be explic
itly associated with some synchronization variable As
a result  when a processor acquires a synchronization
variable  an EC implementation only needs to propa
gate the shared data associated with the synchroniza
tion variable EC  however  requires the programmer
to insert additional synchronization in shared memory
programs  such as the SPLASH benchmarks  to exe
cute correctly on an EC memory Typically  RC does
not require additional synchronization
 Conclusions
The performance of software DSMs is very sensitive to
the number of messages and the amount of data ex
changed to create the shared memory abstraction We
have described a new algorithm for implementing re
lease consistency  lazy release consistency  aimed at re
ducing both the number of messages and the amount
of data exchanged Lazy release consistency tracks the
causal dependencies between writes  acquires  and re
leases  allowing it to propagate writes lazily  only when
they are needed
We have used tracedriven simulation to compare
lazy release consistency to an eager algorithm for im
plementing release consistency  based on Munins write
shared protocol Traces were collected from the pro
grams in the SPLASH benchmark suite  and both up
date and invalidate protocols were simulated for lazy
and eager RC The simulations conrm that the num
ber of messages and the amount of data exchanged are
generally smaller for the lazy algorithm  especially for
programs that exhibit false sharing and make extensive
use of locks Further work will include an implemen
tation of lazy release consistency to assess the runtime
cost of the algorithm
References
 S Adve and M Hill Weak ordering A new
denition In Proceedings of the 
th Annual In 
ternational Symposium on Computer Architecture 
pages   May 


 S V Adve and M D Hill A unied formaliza
tion of four sharedmemory models Technical Re
port CS  University of Wisconsin  Madison 
September 


 Mustaque Ahamad  Phillip W Hutto  and Ranjit
John Implementing and programming causal dis
tributed shared memory In Proceedings of the th
International Conference on Distributed Comput 
ing Systems  pages 	  May 


 BN Bershad and MJ Zekauskas Midway
Shared memory parallel programming with entry
consistency for distributed memory multiproces
sors Technical Report CMUCS
  Carnegie
Mellon University  September 


 JB Carter  JK Bennett  and W Zwaenepoel
Implementation and performance of Munin In
Proceedings of the th ACM Symposium on Oper 
ating Systems Principles  pages   October



 H Davis  S Goldschmidt  and J L Hennessy
Tango A multiprocessor simulation and tracing
system Technical Report CSLTR

  Stan
ford University  


 M Dubois and C Scheurich Memory access
dependencies in sharedmemory multiprocessors
IEEE Transactions on Computers   
June 




	 K Gharachorloo  D Lenoski  J Laudon  P Gib
bons  A Gupta  and J Hennessy Memory con
sistency and event ordering in scalable shared
memory multiprocessors In Proceedings of the

th Annual International Symposium on Com 
puter Architecture  pages   Seattle  Washing
ton  May 



 JR Goodman Cache consistency and sequential
consistency Technical Report Technical report no
  SCI Committee  March 
	

 L Lamport How to make a multiprocessor com
puter that correctly executes multiprocess pro
grams IEEE Transactions on Computers  C
	


  September 


 D Lenoski  J Laudon  K Gharachorloo 
A Gupta  and J Hennessy The directorybased
cache coherence protocol for the DASH multipro
cessor In Proceedings of the 
th Annual In 
ternational Symposium on Computer Architecture 
pages 	
  May 


 K Li and P Hudak Memory coherence in shared
virtual memory systems ACM Transactions on
Computer Systems  
  November 
	

 RJ Lipton and JS Sandberg Pram A scalable
shared memory Technical Report CSTR			 
Princeton University  September 
		
 F Mattern Virtual time and global states of dis
tributed systems In Michel Cosnard  Yves Robert 
Patrice Quinton  and Michel Raynal  editors  Par 
allel  Distributed Algorithms  pages  El
sevier Science Publishers  Amsterdam  
	

 U Ramachandran  M Ahamad  and YA Kha
lidi Unifying synchronization and data transfer in
maintaining coherence of distributed shared mem
ory Technical Report GITCS		  Georgia In
stitute of Technology  June 
		
 JP Singh  WD Weber  and A Gupta Splash
Stanford parallel applications for sharedmemory
Technical Report CSLTR

  Stanford Uni
versity  April 


 WD Weber and A Gupta Analysis of cache in
validation patterns in multiprocessors In Proceed 
ings of the th Symposium on Architectural Sup 
port for Programming Languages and Operating
Systems  pages   April 
	


