Software vs. Hardware Shared Memory Implementation: A Case Study by Cox, Alan L. et al.
Software Versus Hardware Shared Memory
Implementation A Case Study
Alan L  Cox Sandhya Dwarkadas Pete Keleher
Honghui Lu Ramakrishnan Rajamony and Willy Zwaenepoel
Department of Computer Science
Rice University
Abstract
We compare the performance of software supported
shared memory on a general purpose network to
hardware supported shared memory on a dedicated
interconnect
Up to eight processors our results are based on
the execution of a set of application programs on a
SGI D multiprocessor and on TreadMarks a dis 
tributed shared memory system that runs on a Fore
ATM LAN of DECstation s Since the DEC 
station and the D use the same processor pri 
mary cache and compiler the shared memory imple 
mentation is the principal di	erence between the sys 
tems Our results show that TreadMarks performs
comparably to the D for applications with mod 
erate amounts of synchronization but the di	erence
in performance grows as the synchronization frequency
increases For applications that require a large amount
of memory bandwidth TreadMarks can perform bet 
ter than the SGI D
Beyond eight processors our results are based on
execution driven simulation Speci
cally we compare
a software implementation on a general purpose net 
work of uniprocessor nodes a hardware implementa 
tion using a directory based protocol on a dedicated
interconnect and a combined implementation using
software to provide shared memory between multi 
processor nodes with hardware implementing shared
memory within a node For the modest size of the
problems that we can simulate the hardware imple 
mentation scales well and the software implementation
scales poorly The combined approach delivers perfor 
mance close to that of the hardware implementation
for applications with small to moderate synchroniza 
tion rates and good locality Reductions in communi 
This research was supported in part by the National Sci 
ence Foundation under Grants CCR  CCR 
CDA  and CDA 	 and by the Texas Advanced
Technology Program under Grants  and 

cation overhead improve the performance of the soft 
ware and the combined approach but synchronization
remains a bottleneck
  Introduction
Over the last decade considerable e	ort has been
spent on software implementations of shared memory
on general purpose networks eg    We are
however unaware of any study comparing the perfor 
mance of any of these systems to the performance of
a hardware implementation of shared memory on a
dedicated interconnect eg   Several studies
have compared software to hardware cache coherence
mechanisms   but these systems still rely on
hardware initiated data movement and a dedicated
interconnect In this paper we compare a shared 
memory implementation that runs entirely in software
on a general purpose network of computers to a hard 
ware implementation on a dedicated interconnect
Up to eight processors our results are based on an
experimental comparison of a software and a hardware
implementation Speci
cally we compare the Tread 
Marks software distributed shared memory system 
running on a Mbitsecond ATM network connect 
ing  DECstation s to an  processor Silicon
Graphics D These con
gurations have identical
processors clock speeds primary caches compilers
and parallel programming interfaces the ANL PAR 
MACS macros  The similarity between the two
platforms from the neck up avoids many distinc 
tions that often blur comparative studies and allows
us to focus on the di	erences caused by the shared 
memory implementation TreadMarks supports lazy
release consistency  and is implemented as a user 
level library on top of Ultrix  The SGI D
provides processor consistency using a bus snooping
protocol 
We use four applications in our comparison ILINK
SOR TSP and Water TSP uses only locks for syn 
chronization SOR and ILINK use only barriers and
Water uses both For ILINK SOR and TSP we
present results for two di	erent sets of input data For
Water the results are largely independent of the in 
put Instead we present the results for a modi
ed ver 
sion M Water that reduces the amount of synchro 
nization With the exception of SOR better speedups
are obtained on the D There is a strong cor 
relation between the synchronization frequency of the
application and the di	erence in speedup between the
D and TreadMarks With higher synchroniza 
tion frequencies the large latencies of the software im 
plementation become more of a limiting factor SOR
however gets better speedup on TreadMarks than on
the D because this application requires large
memory bandwidths
Beyond eight processors our results are based on
execution driven simulations of systems with up to 
processors We compare three alternative designs i
an all software AS approach connecting  unipro 
cessor machines with a general purpose network ii
an all hardware AH approach connecting  unipro 
cessor nodes with a crossbar network and using a
directory based hardware cache coherence protocol
and iii a hardware software HS approach connect 
ing  bus based multiprocessor nodes with a general 
purpose network and using the TreadMarks software
DSM system The HS approach is appealing from a
cost standpoint because small bus based shared mem 
ory workstations are likely to become cheaper than a
set of uniprocessor workstations with an equal num 
ber of processors The HS approach also avoids the
complexity of directory based cache controllers
We use SOR TSP and M Water in our compari 
son Simulation times for available ILINK inputs were
prohibitively high For all three applications the AS
approach scales poorly compared to the other two For
SOR and TSP performance of AH and HS is com 
parable For M Water the AH approach performs
signi
cantly better because each processor accesses a
majority of the shared data during each step of the
computation and because of the frequency of synchro 
nization Anticipated improvements in network inter 
face technology and attendant decreases in communi 
cation software overhead reduce the performance gap
between the di	erent implementations
The rest of this paper is organized as follows Sec 
tion  details the comparison between the SGI D
and TreadMarks Section  presents simulation re 
sults comparing the AS AH and HS architectures for
a larger number of processors Section  examines re 
lated work Section  presents our conclusions
 SGI D versus TreadMarks
  TreadMarks
In this section we briey describe the release consis 
tency RC model  and the lazy release consistency
LRC implementation  used by TreadMarks Fur 
ther details on TreadMarks may be found in Keleher
et al 
RC is a relaxed memory consistency model In RC
ordinary shared memory accesses are distinguished
from synchronization accesses with the latter cate 
gory subdivided into acquire and release accesses Ac 
quire and release accesses correspond roughly to the
conventional synchronization operations on a lock but
other synchronization mechanisms can be built on this
model as well Essentially RC allows the e	ects of or 
dinary shared memory accesses to be delayed until a
subsequent release by the same processor is performed
The LRC algorithm used by TreadMarks delays the
propagation of modi
cations to a processor until that
processor executes an acquire To do so LRC uses
the happened before  partial order  The happened 
before  partial order is the union of the total pro 
cessor order of the memory accesses on each individ 
ual processor and the partial order of release acquire
pairs Vector timestamps are used to represent the
partial order  When a processor executes an ac 
quire it sends its current vector timestamp in the ac 
quire message The last releaser then piggybacks on
its response a set of write notices These write notices
describe the shared data modi
cations that precede
the acquire according to the partial order The ac 
quiring processor then determines the pages for which
the incoming write notices contain vector timestamps
larger than the timestamp of its copy of that page in
memory For these pages the shared data modi
ca 
tions described in the write notices must be reected in
the acquirers copy To accomplish this TreadMarks
invalidates the copies
On an access fault a page is validated by bringing
in the necessary modi
cations to the local copy in the
form of dis A di	 is a run length encoding of the
changes made to a single virtual memory page The
faulting processor uses the vector timestamps associ 
ated with its copy of the page and the write notices it
received for that page to identify the necessary di	s
   Experimental Platform
The system used to evaluate TreadMarks consists
of  DECstation  workstations each with a
Mhz MIPS R processor a  Kbyte primary
instruction cache a  Kbyte primary data cache
and  Mbytes of memory The data cache is write 
through with a write bu	er connecting it to main
memory The workstations are connected to a high 
speed ATM network using a Fore Systems TCA 
network adapter card supporting communication at
 Mbitssecond In practice however user to user
bandwidth is limited to  Mbitssecond The ATM
interface connects point to point to a Fore Systems
ASX  ATM switch providing a high aggregate
bandwidth because of the capability for simultaneous
full speed communication between disjoint worksta 
tion pairs The workstations run the Ultrix version
 operating system TreadMarks is implemented as
a user level library linked in with the application pro 
gram No kernel modi
cations are necessary Tread 
Marks uses conventional Unix socket mprotect and
signal handling interfaces to implement communica 
tion and memory management The minimum time
for a remote lock acquisition is  milliseconds the
time for an  processor barrier is  milliseconds
The shared memory multiprocessor used in the
comparison is a Silicon Graphics D with 
Mhz MIPS R processors Each processor has
a  Kbyte primary instruction cache and a  Kbyte
primary data cache The primary data cache imple 
ments a write through policy to a write bu	er In ad 
dition each processor has a  Mbyte secondary cache
implementing a write back policy The secondary
caches and the main memory  Mbytes are con 
nected via a  Mhz  bit wide shared bus Cache
coherence between the secondary caches is maintained
using the Illinois protocol The presence of the write
bu	er between the primary and the secondary cache
however makes the memory processor consistent The
SGI runs the IRIX Release  System V operating
system
An important aspect of our evaluation is that the
DECstation  and the SGI D have the
same type of processor running at the same clock
speed the same size primary instruction and data
caches and a write bu	er from the primary cache to
the next level in the memory hierarchy main memory
on the DECstation the secondary cache on the SGI
For both machines we use the same compiler gcc 
with  O optimization and the program sources are
identical using the PARMACSmacros The only sig 
ni
cant di	erence between the two parallel computers
is the method used to implement shared memory ded 
icated hardware versus software on message passing
hardware
Single processor performance on the two machines
depends on the size of the programs working set
Both machines are the same speed when execut 
ing entirely in the primary cache If the working
set 
ts in the secondary cache on the D a
single D processor is  to  slower than
a DECstation  because the main memory
of the DECstation  is slightly faster than
the secondary cache of the D processor The
Ds secondary cache is clocked at the same speed
as the backplane bus  MHz If the working set is
larger than the secondary cache size the D slows
down signi
cantly
  Application Suite
We used four programs for our comparison ILINK
SOR TSP and Water
ILINK  is a widely used genetic linkage analysis
program that locates speci
c disease genes on chromo 
somes We ran ILINK with two di	erent inputs CLP
and BAD both corresponding to real data sets used
in disease gene location The CLP and BAD inputs
show the best and the worst speedups respectively
among the inputs that are available to us
Red Black Successive Over Relaxation SOR is a
method for solving partial di	erential equations The
SOR program divides the matrix into roughly equal
size bands of consecutive rows assigning each band to
a di	erent processor Communication occurs across
the boundary between bands We ran SOR for a 
iterations on a    and a    matrix
We chose the   problem size because it does
not cause paging on a single DECstation and it 
ts
within the secondary cache of the D when run 
ning on  processors The    run is included
to assess the e	ect of changing the communication to
computation ratio
TSP solves the traveling salesman problem using
a branch and bound algorithm The program has a
shared global queue of partial tours Each process
gets a partial tour from the queue extends the tour
and returns the results back to the queue We use
  and  city problems as input Although the pro 
gram exhibits nondeterministic behavior occasionally
resulting in super linear speedup executions with the
same input produce repeatable results
Water from the SPLASH suite  is a molecular
dynamics simulation The original Water program ob 
tains a lock on the record representing a molecule each
Program DEC TreadMarks SGI
ILINK CLP   
ILINK BAD   
SOR      
SOR      
TSP    
TSP    
Water     
M Water     
Table  Single processor execution times
time it updates the contents of the record We modi 

ed Water such that each processor instead uses a lo 
cal variable to accumulate its updates to a molecules
record during an iteration At the end of the itera 
tion it then acquires a lock on each molecule that it
needs to update and applies the accumulated updates
at once The number of lock acquires and releases
for each processor in M Water is thus equal to the
number of molecules that processor updates In the
original program it is equal to the number of updates
that processor performs a much larger quantity We
present the results for Water and M Water for a run
with  molecules for  time steps The results for
Water were largely independent of the data set chosen
  Results
Figures  to  present the speedups achieved
for ILINK SOR TSP Water and M Water both
on TreadMarks and the D The TreadMarks
speedups are relative to the single processor DECsta 
tion run times without TreadMarks Table  presents
the single processor execution times on both machines
including the DECstation with and without Tread 
Marks As can be seen from this table the presence of
TreadMarks has almost no e	ect on single processor
execution times Finally Table  details the o	 node
synchronization rates the number of messages and
the amount of data movement per second on Tread 
Marks for each of the applications on  processors
Sections  to  discuss the results for each ap 
plication in detail
  ILINK
Figures  and  show ILINKs speedup for the CLP
and BAD inputs Among the inputs that are available
to us the CLP and BAD inputs show the best and the
worst speedups and the smallest and largest di	erence
TreadMarks SGI 4D/480
1 2 3 4 5 6 7 8
Sp
ee
du
p 
vs
. P
ro
ce
ss
or
s
1
2
3
4
5
6
7
8
Figure  ILINK CLP
TreadMarks SGI 4D/480
1 2 3 4 5 6 7 8
Sp
ee
du
p 
vs
. P
ro
ce
ss
or
s
1
2
3
4
5
6
7
8
Figure  ILINK BAD
TreadMarks SGI 4D/480
1 2 3 4 5 6 7 8
Sp
ee
du
p 
vs
. P
ro
ce
ss
or
s
1
2
3
4
5
6
7
8
Figure  SOR    matrix
ILINK SOR TSP Water M Water
BAD CLP        city  city  
Barrierssecond        
Remote lockssecond        
Messagessecond        
Kbytessecond        
Table   processor TreadMarks execution statistics
Treadmarks SGI 4D/480
1 2 3 4 5 6 7 8
Sp
ee
du
p 
vs
. P
ro
ce
ss
or
s
1
2
3
4
5
6
7
8
Figure  SOR    matrix
TreadMarks SGI 4D/480
1 2 3 4 5 6 7 8
Sp
ee
du
p 
vs
. P
ro
ce
ss
or
s
1
2
3
4
5
6
7
8
9
Figure  TSP  cities
TreadMarks SGI 4D/480
1 2 3 4 5 6 7 8
Sp
ee
du
p 
vs
. P
ro
ce
ss
or
s
1
2
3
4
5
6
7
8
Figure  TSP  cities
TreadMarks SGI 4D/480
1 2 3 4 5 6 7 8
Sp
ee
du
p 
vs
. P
ro
ce
ss
or
s
0
1
2
3
4
5
6
7
8
Figure  Water  molecules and  steps
TreadMarks SGI 4D/480
1 2 3 4 5 6 7 8
Sp
ee
du
p 
vs
. P
ro
ce
ss
or
s
1
2
3
4
5
6
7
8
Figure  M Water  molecules and  steps
in speedup between the D and TreadMarks 
vs  and  vs 
ILINK achieves less than linear speedup on both the
D and TreadMarks because of a load balancing
problem inherent to the nature of the algorithm  It
is not possible to predict in advance whether the set
of iterations distributed to the processors will result in
the same amount of work on each processor without
signi
cant computation and communication
The D outperforms TreadMarks because of
the large amount of communication The communica 
tion rate for the CLP input set is  Kbytessecond
and  messagessecond on  processors compared
to  Kbytessecond and  messagessecond for
the BAD input set hence the better speedups achieved
for CLP
   SOR
Figures  and  show SORs speedup for  itera 
tions of     and     problems We
excluded the 
rst iteration of SOR from the data and
message rates in order to avoid having the initial data
distribution skew our results Of the four applications
used SOR is the only one for which there is a sizable
di	erence in single processor execution time between
TreadMarks and the D TreadMarks is approxi 
mately  faster on a single processor because both
problem sizes exceed the size of the secondary cache
on the SGI
In addition to lower single processor execution
times better speedups are achieved on TreadMarks
The di	erence is partly due to the way in which
TreadMarks communicates updates to shared mem 
ory Points at the edge of the matrix are initialized to
values that remain 
xed throughout the computation
Points in the interior of the matrix default to  Dur 
ing the early iterations the points at the interior of
the array are recomputed and stored to memory but
their value remains the same Only the points near
the edge change value On the D the hardware
cache coherence protocol updates the memory regard 
less of the fact that the values remain the same Tread 
Marks however only communicates the points that
have changed value because di s see Section  are
computed from the contents of a page Consequently
the amount of data movement by TreadMarks is sig 
ni
cantly less than the amount of data movement by
the D The estimated data movement for  pro 
cessors by the D after the initial data migration
between processors is  Kbytes whereas the actual
data movement by TreadMarks is  Kbytes
To eliminate this e	ect we initialized the matrix
such that every point changes value at every itera 
tion equalizing the data movement by the D
and TreadMarks Even in this modi
ed version the
speedup is better on TreadMarks than on the D
We attribute this result to the fact that most commu 
nication in SOR occurs at the barriers and between
neighbors On the ATM network this communica 
tion can occur in parallel On the D it causes
contention for the cache tags and the bus
  TSP
Figures  and  show TSPs speedup for solving a  
city and an  city problem Branch and bound algo 
rithms can achieve super linear speedup if the parallel
version 
nds a good approximation early on allowing
it to prune more of the search tree than the sequential
version An example of such super linear speedup can
be seen on the D for the  city problem More
important than the absolute values of the speedups
is the comparison between the speedups achieved on
the two systems We see better performance on the
D than on TreadMarks  vs  for the  
city problem and  vs  for the  city prob 
lem The di	erence is slightly larger for the  city
problem because of the increased synchronization and
communication rates see Table 
The performance on TreadMarks su	ers from the
fact that TSP is not a properly labeled  program
Although updates to the current minimumtour length
are synchronized read accesses are not Since Tread 
Marks updates cached values only on an acquire a
processor may read an old value of the current min 
imum The execution remains correct but the work
performed by the processor may be redundant since
a better tour has already been found elsewhere On
the D this is unlikely to occur since the cache
consistency mechanism invalidates cached copies of
the minimum when it is updated By propagating
the bound earlier the D reduces the amount
of work each processor performs leading to a better
speedup Adding synchronization around the read ac 
cesses would hurt performance given the large number
of such accesses
To eliminate this e	ect we modi
ed TSP to per 
form an eager lock release instead of a lazy lock release
after updating the lower bound value With an eager
release the modi
ed values are updated at the release
rather than at a subsequent acquire The speedup
of TSP improved from  to  on  processors
vs  on the D The remaining di	erences
between the DSM and the SGI performance can be
explained by faster lock acquisition on the SGI com 
pounded with the nondeterministic e	ect of picking up
redundant work due to the slight delay in propagating
the bound
  Water
Figure  shows Waters speedup executing  steps on
 molecules TreadMarks gets no speedup except
on  processors because the high rate of synchroniza 
tion  remote lock acquiressecond causes many
messages  messagessecond
Figure  shows M Waters speedup executing 
steps on  molecules On the D M Waters
speedup is virtually identical to Water On Tread 
Marks however there is a marked performance im 
provement We obtain a speedup of  using 
processors Compared to Water the number of mes 
sagessecond drops from  to 
Part of the high cost of message transmission is due
to the user level implementation of TreadMarks in
particular the need to trap into the kernel to send and
receive messages We have implemented TreadMarks
inside the Ultrix kernel in order to assess the trade 
o	s between a user level and a kernel level implemen 
tation In comparison the minimum time to acquire
a lock drops from  to  milliseconds and the
time for an  processor barrier drops from  to
 milliseconds For ILINK SOR and TSP the dif 
ferences between the kernel and user level implemen 
tations are minimal reecting the low communication
rates in these applications For M Water however the
di	erences are substantial Speedup on  processors
increases from  for the user level implementation
to  for the kernel level implementation compared
to  for the D
  Summary
The relative magnitude of the di	erences in
speedup between TreadMarks and the D for
ILINK TSP Water and M Water roughly correlate to
the di	erences in the synchronization rates For TSP
Water and M Water which are primarily lock based
the di	erence in speedup is closely related to the fre 
quency with which o	 node locks are acquired On 
processors the di	erence in speedup is  for Water
with  remote lock accesses per second  for M 
Water   for the  city TSP  and  for
the  city TSP  In addition for TSP the D
performs better because the eager nature of the cache
consistency protocol reduces the amount of redundant
work performed by individual processors For ILINK
which uses barriers the di	erence in speedup can be
explained by the barrier synchronization frequency a
di	erence of  for the BAD data set with  bar 
riers per second vs a di	erence of  for CLP with
 barriers per second For SOR the larger memory
bandwidth available in TreadMarks results in better
speedups Dual cache tags and a faster bus relative
to the speed of the processors are necessary to over 
come the bandwidth limitation on the SGI
The ATM LANs longer latency makes synchro 
nization more expensive on TreadMarks than on the
D Moving the implementation inside the kernel
as we did is only one of several mechanisms that can
be used to reduce message latency Our results for
M Water demonstrate the possible performance im 
provement
 Comparison of Larger Systems
In this section we extend our results to larger num 
bers of processors The software approach scales at
least conceptually to a larger number of processors
without modi
cation The hardware approach how 
ever becomes more complex once the number of pro 
cessors exceeds what can reasonably be supported by
a single bus The processor interconnect instead be 
comes a mesh or a crossbar with one or more proces 
sors at the nodes and the cache controllers implement
a directory based cache coherence protocol We also
model a third architecture that consists of a number
of bus based multiprocessors connected by a general 
purpose network Each node has sucient bus band 
width to avoid contention Conventional bus snooping
hardware enforces coherence between the processors
within a node Coherence between di	erent nodes is
implemented in software We will refer to these three
architectures as the All Software AS All Hardware
AH and Hardware Software HS approaches
The HS approach appears promising both in terms
of cost and complexity Compared to the AS ap 
proach bus based multiprocessors with a small num 
ber of processors N  are cheaper than N comparable
uniprocessor workstations Furthermore the cost of
the interconnection hardware is reduced by roughly a
factor of N  Compared to the AH approach commod 
ity parts can be used reducing the cost and complex 
ity of the design
 Simulation Models
We simulated the three architectures using an
execution driven simulator  Instead of the
DECstation  and SGI D we base our
models on leading edge technology All of the archi 
tectural models use RISC processors with a  Mhz
clock  Kbyte direct mapped caches with a block
size of  bytes and main memory sucient to hold
the simulated problem without paging We simulate
up to  processors for each architecture
In both the AH and the AS models each node has
one processor and a local memory module A cache
miss satis
ed by local memory takes  processor cy 
cles In the HS model each node has  processors
connected by a  bit wide split transaction bus op 
erating at  MHz A cache miss satis
ed by local
memory takes  processor cycles slightly longer than
the AH and the AS models because of bus overhead
In the AH model the nodes are connected by a
crossbar network with point to point bandwidth of
 Mbytessecond and a latency of  nanoseconds
We used a crossbar in order to minimize the e	ect of
network contention on our results The point to point
bandwidth is the same as the Intel Paragons network
Cache coherence is maintained using a directory based
protocol A cache miss satis
ed by remote memory
takes  to  processor cycles depending on the
blocks location and whether it has been modi
ed
These cycle counts are similar to those for the Stan 
ford DASH  and FLASH  multiprocessors
In both the AS and the HS models the general 
purpose network is an ATM switch with a point to 
point bandwidth of  Mbitsecond and a latency
of  microsecond Memory consistency between the
nodes is maintained using the TreadMarks LRC in 
validate protocol see Section  The simulations
account for the wire time contention for the network
links and the software overhead of entering the ker 
nel to send or receive messages including data copying
   message size in words processor cycles
calling a user level handler for page faults and incom 
ing messages  processor cycles and creating a
di	  words per page processor cycles The values
are based on measurements of the TreadMarks imple 
mentation on the DECstation  described in
Section 
For the HS approach all of the processors within a
node are treated as one by the DSM system We as 
sume that cache and TLB coherency mechanisms will
ensure that processors within a node see up to date
values Multiple faults to the same page are merged
by the DSM system Similarly modi
cations to the
same page made by processors on the same node are
merged into a single di	 Synchronization is imple 
mented through a combination of shared memory and
message passing reecting the hierarchical structure
of the machine For barriers each processor updates
a local counter until the last processor on the node
has reached the barrier The last processor sends the
arrival message to the manager When the last ar 
rival message arrives at the manager it issues a de 
parture message to each node Similarly locks are
implemented using a token The token is held at one
node at a time In order to acquire a lock a processor
must 
rst bring the token to its node If the token
already resides at the node no messages are required
  Results
We simulated SOR TSP and M Water Exces 
sively long simulation times prevented us from includ 
ing simulation results for ILINK Figures  to  re 
port the speedups achieved on the three di	erent ar 
chitectures Since the uniprocessor execution times
are roughly identical for all three architectures the ex 
ecution times are omitted Figures  and  present
the message and data movement totals for AS and
AH The totals are presented relative to the AS num 
bers The message totals are broken down into miss
and synchronization messages and the data totals are
broken down into miss consistency and header data
see Section  Sections  to  discuss the
observed performance of the individual applications
Section  discusses the e	ect of reducing the soft 
ware overhead for the AS and HS architectures
  SOR
Figure  presents speedups for the SOR program for
a    matrix Since we only simulate a small
number of iterations we begin collecting statistics
from the second iteration in order to prevent cold start
misses from dominating our results Linear speedup is
AH HS AS
0 8 16 32 48 64
Sp
ee
du
p 
vs
. P
ro
ce
ss
or
s
0
8
16
24
32
40
48
56
64
Figure  SOR   
AH HS AS
0 8 16 32 48 64
Sp
ee
du
p 
vs
. P
ro
ce
ss
or
s
0
8
16
24
32
40
48
56
64
Figure  TSP  cities
AH HS AS
0 8 16 32 48 64
Sp
ee
du
p 
vs
. P
ro
ce
ss
or
s
0
8
16
24
32
40
48
56
64
Figure  M Water  molecules and  steps
HS - Miss Msgs
HS - Sync Msgs
AS - Miss Msgs
AS - Sync Msgs
SOR TSP M-Water
Pe
rc
en
ta
ge
 o
f t
ot
al
 A
S 
m
sg
s
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Figure  Total messages  processors
HS - Miss Data
HS - Consistency Data
HS - Msg Header Data
AS - Miss Data
AS - Consistency Data
AS - Msg Header Data
SOR TSP M-Water
Pe
rc
en
ta
ge
 o
f t
ot
al
 A
S 
da
ta
0%
20%
40%
60%
80%
100%
Figure  Total data  processors
50/8 500/8 500/28 5000/28
0 16 32 48 64
Sp
ee
du
p 
vs
. P
ro
ce
ss
or
s
0
8
16
24
32
40
48
56
64
Figure  Varying software overheads for AS SOR
  
50/8 500/8 500/28 5000/28
0 16 32 48 64
Sp
ee
du
p 
vs
. P
ro
ce
ss
or
s
0
2
4
6
8
10
12
14
16
18
Figure  Varying software overheads for AS M 
Water  molecules and  steps
achieved on AH and HS while the performance of AS
is sub linear due to the high communication cost SOR
performs mainly nearest neighbor communication and
thus takes advantage of the hierarchical nature of the
HS architecture The only processors to incur a high
penalty for misses are the edge processors that share
data with processors that are o	 node and hence this
program incurs little extra overhead on HS in com 
parison to AH This conclusion is supported by the
observation that the number of messages for the  
processor execution on HS is  of the number of
messages for the  processor AS execution see Fig 
ure 
   TSP
Figure  presents speedups for the TSP programwith
a  city input This program has a very high compu 
tation to communication ratio However as the num 
ber of processors increases this ratio decreases enough
for the high latency of communication in the AS archi 
tecture to become a bottleneck Figure  shows that
the number of messages for the HS architecture is less
than  that for the AS architecture The reduction
is not  fold because the next processor to access the
queue is more likely to be from another node Fig 
ure  shows that the amount of data movement by
HS is about  that for AS The  fold reduction in
data movement is a result of HS coalescing changes
from di	erent processors on a node into a single di	
see Section 
  MWater
Figure  presents speedups for M Water running 
time steps on  molecules Beyond  processors
AH is the only architecture whose speedup improves
AS obtains a peak speedup of  at  processors
and HS reaches its peak speedup of  at  pro 
cessors The performance is poor for the AS architec 
ture because of the large number of synchronization
operations as well as the large amount of data com 
municated The HS approach gets better performance
primarily because of the reduction in the number of
messages Also less data is sent because of the co 
alescing of di	s and the reduction in the amount of
consistency data Although HS gets a  fold decrease
in the overall number of messages and a  fold de 
crease in the amount of data movement compared to
the AS architecture its performance does not match
AH because the number of synchronization messages
and the wait time to acquire the locks remain high
see Figure 
  Reduced Software Overhead
By optimizing the software structure as in Pere 
grine  or a user level hardware interface as
in SHRIMP  lower software overheads can be
achieved In this section we examine the e	ect of re 
ducing both the xed and per word overheads Specif 
ically we examine the e	ect of reducing the 
xed cost
from  processor cycles to  as in Peregrine
and  as in SHRIMP and the per word cost from
 processor cycles to  one bcopy to the interface
Figures  and  present the speedups for SOR
and M Water on the AS architecture These show
the smallest and the largest e	ects for reducing the
software overhead For SOR the 
xed cost has the
largest e	ect on performance The resulting speedup
approaches that of the other two architectures For M 
Water both the 
xed and per word cost have equal
e	ects on performance
Figure  presents the speedups for M Water on
the HS architecture Because HS reduces the amount
of data movement more than the number of messages
compared to AS the 
xed cost has a more signi
cant
e	ect than it did for AS
 Summary
We conclude that the AS approach does not scale
well for the applications and problem sizes that we
simulated unless the software overheads are signi
 
cantly reduced For SOR and TSP the HS perfor 
mance is almost identical to the AH approach For
50/8 500/8 500/28 5000/28
0 16 32 48 64
Sp
ee
du
p 
vs
. P
ro
ce
ss
or
s
0
5
10
15
20
25
30
Figure  Varying software overheads for HS M 
Water  molecules and  steps
Water the frequent synchronization results in inferior
performance for HS compared to AH Our results are
of course limited by the applications we simulated
Due to simulation time constraints the problem sizes
are small The e	ect of larger applications remains to
be investigated
 Related Work
TreadMarks implements shared memory entirely in
software Both data movement and memory coherence
are performed by software using the message passing
and virtual memory management hardware Previous
evaluations of such systems for example Carter et
al  have compared their performance to hand coded
message passing
Other related studies have examined software ver 
sus hardware cache coherence In these studies the
hardware is responsible for performing the data move 
ment Upon access the hardware automatically loads
invalid cache lines from memory To maintain co 
herency these schemes require the placement of cache
ushinvalidation instructions by the compiler or the
programmer at the end of critical sections Cytron
et al  and Cheong and Veidenbaum  describe al 
gorithms for compiler based software cache coherence
Owicki and Agarwal compare analytically the perfor 
mance of such a scheme to snoopy cache coherence
hardware  Petersen on the other hand describes
a software cache coherence scheme using the virtual
memory management hardware  This scheme is
transparent to the programmer It does not require
the programmer or compiler to insert cache ush in 
structions Using trace driven simulation she com 
pared the performance of her software scheme on a
shared bus to snoopy cache hardware
A few implementations using both hardware and
software have been proposed Both Chaiken et al 
and Hill et al  describe shared memory implemen 
tations that handle the most common cache coherence
operations in hardware and the most unusual opera 
tions in software thereby reducing the complexity of
the hardware without signi
cantly impacting the per 
formance
 Conclusions
In this paper we have assessed the performance
tradeo	s between hardware and software implemen 
tations of shared memory
For small numbers of processors we have compared
a bus based shared memory multiprocessor the SGI
D to a network of workstations running a soft 
ware DSM system speci
cally the TreadMarks DSM
system running on an ATM network of DECStation 
s An important aspect of this comparison is
the similarity between the two platforms in all aspects
processor cache compiler parallel programming in 
terface except the shared memory implementation
For the applications with moderate synchronization
and communication demands the two con
gurations
perform comparably When these demands increase
the communication latency and the software overhead
of TreadMarks causes it to fall o	 in performance For
applications with high memory bandwidth require 
ments the network of workstations can perform better
because it provides the processor with a private path
to memory
We use simulation to extend our results to larger
numbers of processors For the sizes of the applica 
tions we considered a straightforward extension of
the software DSM system scales poorly unless soft 
ware overheads reect recent advances in communi 
cation software and hardware We investigated an
intermediate approach using a general purpose net 
work and software DSM to interconnect hardware bus 
based multiprocessor nodes Such a con
guration can
be constructed with commodity parts resulting in
cost and complexity gains over a hardware approach
that uses a dedicated interconnect and a directory 
based cache controller For applications with good
locality and moderate synchronization rates the com 
bined hardware software approach results in perfor 
mance comparable to that obtained using a pure hard 
ware approach
Acknowledgements
We thank Michael Scott at the University of
Rochester and Michael Zeitlin at Texaco for providing
us with access to their Silicon Graphics multiproces 
sors We would also like to thank Benhaam Aazhang
John Bennett Bob Bixby Keith Cooper and John
Mellor Crummey for providing us with the extra com 
puting power that we needed to complete the simula 
tions
References
 S
 V
 Adve and M
 D
 Hill
 A unied formalization of four
shared memory models
 IEEE Transactions on Parallel
and Distributed Systems  June 

 B
N
 Bershad M
J
 Zekauskas and W
A
 Sawdon
 The
Midway distributed sharedmemory system
 In Proceedings
of the   CompCon Conference pages 	 February


 M
A
 Blumrich K
 Li R
 Alpert C
 Dubnicki E
W
 Fel 
ten and J
 Sandberg
 Virtual memorymapped network in 
terface for the SHRIMP multicomputer
 Technical Report
CS TR 	  Department of Computer Science Prince 
ton University November 

 J
B
 Carter J
K
 Bennett andW
 Zwaenepoel
 Implemen 
tation and performance of Munin
 In Proceedings of the
th ACM Symposium on Operating Systems Principles
pages  October 

 D
 Chaiken J
 Kubiatowicz and A
 Agarwal
 LimitLESS
directories A scalable cache coherence scheme
 In Pro
ceedings of the th Symposium on Architectural Support
for Programming Languages and Operating Systems pages
 April 

 H
 Cheong and A
V
 Veidenbaum
 A cache coherence
scheme with fast selective invalidation
 In Proceedings of
the th Annual International Symposium on Computer
Architecture pages  June 

	 R
 G
 Covington S
 Dwarkadas J
 R
 Jump S
 Madala
and J
 B
 Sinclair
 The ecient simulation of parallel com 
puter systems
 International Journal in Computer Simu
lation  January 

 R
 Cytron S
 Karlovsky and K
P
 McAulie
 Automatic
management of programmable caches
 In  Interna
tional Conference on Parallel Processing pages 
August 

 S
 Dwarkadas A
 A
 Schaer R
 W
 Cottingham Jr
 A
 L

Cox P
 Keleher and W
 Zwaenepoel
 Parallelization of
general linkage analysis problems
 To appear in Human
Heredity 

 M
 Galles and E
 Williams
 Performance optimizations
implementation and vericationof the SGI Challengemul 
tiprocessor
 Technical report Silicon Graphics Computer
Systems 

 K
 Gharachorloo D
 Lenoski J
 Laudon P
 Gibbons
A
 Gupta and J
 Hennessy
 Memory consistency and event
ordering in scalable shared memory multiprocessors
 In
Proceedings of the th Annual International Symposium
on Computer Architecture pages  May 

 M
 D
 Hill J
 R
 Larus S
 K
 Reinhardt and D
 A
 Wood

Cooperative shared memory Software and hardware sup 
port for scaleable multiprocessors
 In Proceedings of the
th Symposium on Architectural Support for Programming
Languages and Operating Systems pages 	 Octo 
ber 

 D
B
 Johnson and W
 Zwaenepoel
 The Peregrine high 
performance RPC system
 Software	 Practice and Experi
ence  February 

 P
 Keleher A
 L
 Cox and W
 Zwaenepoel
 Lazy release
consistency for software distributed shared memory
 In
Proceedings of the th Annual International Symposium
on Computer Architecture pages  May 

 P
 Keleher S
 Dwarkadas A
 Cox and W
 Zwaenepoel

Treadmarks Distributed shared memory on standard
workstations and operating systems
 In Proceedings of the
 Winter Usenix Conference pages  January


 J
 Kuskin and D
 Ofelt et al
 The Stanford FLASH mul 
tiprocessor
 To appear in Proceedings of the st Annual
International Conference on Computer Architecture April


	 D
 Lenoski J
 Laudon K
 Gharachorloo W
 D
 Weber
A
 Gupta J
 Hennessy M
 Horowitz and M
 S
 Lam

The Stanford DASH multiprocessor
 IEEE Computer
	 March 

 K
 Li and P
 Hudak
 Memory coherence in shared virtual
memory systems
 ACM Transactions on Computer Sys
tems 	 November 

 E
 L
 Lusk and R
 A
 Overbeek et al
 Portable Programs
for Parallel Processors
 Holt Rinehart and Winston Inc
	

 S
 Owicki and A
 Agarwal
 Evaluating the performance of
software cache coherence
 In Proceedings of the rd Sym
posium on Architectural Support for Programming Lan
guages and Operating Systems pages  May 

 M
 Papamarcos and J
 Patel
 A low overhead coherence so 
lution for multiprocessors with private cache memories
 In
Proceedings of the th Annual International Symposium
on Computer Architecture pages  May 

 K
 Petersen
 Operating System Support for Modern Mem
ory Hierarchies
 PhD thesis Princeton University May


 J
P
 Singh W
 D
 Weber and A
 Gupta
 SPLASH Stan 
ford parallel applications for shared memory
 Technical Re 
port CSL TR   Stanford University April 

