Decentralized Algorithm for Communication Efficient Distributed Shared Memory by Burge, Legand L. III
A DECENTRALIZED ALGORITHM FOR COMMUNICATION
EFFICIENT DISTRIBUTED SHARED MEMORY
By
LEGAND L. BURGE III
Bachelor of Science
Langston University
Langston, Oklahoma
1992
Submitted to the Faculty of the
Graduate College of
Oklahoma State University
in partial fulfillment of
the requirements for
the Degree of
MASTER OF SCIENCE
July, 1995
OKLAHO:MA STATE UNIVERSITY
A DECENTRALIZED ALGORITHM FOR COMMUNICATION
EFFICIENT DISTRIBUTED SHARED MEMORY
Thesis Approved:
Thesis Adviser
~/1=?=1E
Dean of the Gra.duate College
)]
ACKNOWLEDGMENTS
I sincerely thank my gra.duate adviser Dr. Mitchell L. Neilsen for the guidance,
help and time he has given me for the completion of my thesis work. His perseverance
and ha.rd work inspired me to venture into the advanced aspects of this work. I would
like to express my sincere thanks to Dr. George for his direction and leadership.
Without the encouragement and help he has given me, the completion of this work
would have been impossihle. I also sincerely thank Dr. H. Lu for serving on my
committee.. Her suggestions have helped me to improve the quality of this work.
My -special thanks goes to Dr. In Hai Ro, from Langston University, for the
support that you have giving me throughout my studies here at Oklahoma State
University. I would also like to thank Dr. Massaki Mizuno, from Kansas State
University, for allowing me to utilized his simulator for my research.
My respectful thanks goes to my parents Dr. Legand L. Burge Jr. and Mrs.
Gwenetta V. Burge for all the love and support they have given me in my life. And,
last but certainly not least, I thank aU other members of my family for the love,
encouragement and confidence they have endorsed in me.
111
TABLE OF CONTENTS
Chapter
1. INTRODUCTION
1.1 Thesis ....
1.2 Organization
2. LITERATURE REVIEW
2.1 Memory Consistency
2.2 Existing Cache-consistency Protocols for DSM
2.3 Definitions .
3. PROBLEM STATEMENT
3.1 Overview of Protocol
3.1.1 Distributed Manager Implementations
3.1.2 Data Structures .
3.2 Description of Protocol .
4. PERFORMANCE ANALYSIS AND RESULTS
4.1 DSM Simulation .
4.1.1 Per£ormance Metrics
4.2 Analysis .
4.2.1 Centralized Protocols .
4.2.2 Decentralized Protocol
5. CONCLUSION .
IV
Page
I
5
6
7
7
9
10
13
14
14
17
17
23
23
23
25
25
30
40
5.1 Summary ..
5.2 Future Work.
BIBLIOGRAPHY . .
APPENDIX A: RELATED PROOFS
A.l Proof of Program Correctness
A.2 Proof of Bound on Forward Messages
APPENDIX B: COST jPERFORMANCE ....
APPENDIX C: SIMULATION PARAMETERS
v
40
40
41
43
44
48
50
53
Table
B.1
B.2
C.1
C.2
C.3
LIST OF TABLES
Memory Cost Performance. . . . .
Communication Cost Performance
Simulation Parameters - Browns Protocol
Simulation Parameters - Mizuno, Raynal, Singh, and Neilsen Protocol
Simulation Parameters - Decentralized Protocol . . . . . . . . . . . .
VI
Page
51
52
54
54
55
LIST OF FIGURES
R~ p.
1.1 MIMD Hierarchy 2
1.2 Tightly Coupled System 3
1.3 Loosely Coupled System 4
4.1 Access Efficiency vs the Probability of Reading (Centralized DSM) . 26
4.2 Average Access Time vs the Probability of Reading (Centralized DSM) 27
4.3 Average Access Time vs Number of Processors (Centralized DSM) 27
4.4 Average Access Time vs Number of Objects (Centralized DSM). 28
4.5 Average Wait Time vs Number of Processors (Centralized DSM) . 29
4.6 Average Wait Time vs Number of Objects (Centralized DSM) 30
4.7 Locality of Reference . . . . . . . . 30
4.8 Average Access Time vs Threshold 31
4.9 Average Forward Messages/Forward Request 32
4.10 Access Efficiency VB the Probability of Reading (Decentralized DSM) 33
4.11 Average Access Time vs Number of Processors (Decentralized DSM) 34
4.12 Average Wait Time vs Number of Processors (Decentralized DSM). 35
4.13 Average Access Time vs Number of Operations (Decentralized DSM) 36
4.14 Average Wait Time vs Number of Objects (Decentralized DSM) 37
4.15 Performance Comparison of DSM Algorithms 50/50 37
4.16 Performance Comparison of DSM Algorithms 80/20 38
4.17 Perfonnance Comparison of DSM Algorithms 80/0 39
VII
CHAPTER 1
INTRODUCTION
A sequ,ential computer executes one CPU instruction at a time. Over the years
sequential computers have increased steadily in performance primarily as a result of
improvements in digital hardware technology. One major concern of computer de-
signers is that logic and memory devices are approaching ultimate physicallirnits on
their size and speed. While size reductions and speed increases of a few orders of mag-
nitude beyond present levels seem feasible, further improvements in the performance
of sequential computers may not be achievable at acceptable cost. A more economic
solution is to design systems that can process more than one CPU instruction at a
time. This is known as parallel processing. Paranel processors are also referred to
as distributed systems. These systems consists of an interconnected collection of au-
tonomous computers [Sta84]. There are many ways of classifying distributed systems
ba;sed on their structure or behavior.
Based on Flynn's taxonomy of computer architectures, distributed systems belong
to the MIMD (multiple instruction multiple data) class of computer architectures
[Tan92]. The MIMD class consist of two categories: those that have shared memory
(tightly coupled), and those that do not (loosely coupletl).
1
2As shown in Figure 1.1, each category can be furthered divided based on the
architecture of the interconnection network. In bus-based systems, there is a single
network, backplane, bus, cable, or other medium that connects all machines. Switched
systems connect machines by individua~ wires.
MIMD
tightly couPy loosely coupled
Multi- Multi-
processors computers
/' ~ / ~
bus-based switched bus-based switched
I I
Hyper-
Sequent RP3 LAN cube
Figure 1.1 MIMD Hierarchy
Tightly coupled systems are also referred to as multiprocessors. In multiproces-
sor systems, at least part of the primary memory is shared as shown in Figure 1.2. A
system with this shared (globaQ primary memory organization provides a convenient
message depository for fast processor to processor communication.
3A shared memory can, however, be a major bottleneck, p.articularly when the
processors must share large amounts of information since normally only one processor
can access a given memory module at a time [Hay88]. Tightly coupled systems tend
to be used more as parallel systems (working on a single problem).
PE 1 PE • • • PEn2
Interconnection Network N
Memory
M
Processing
elements
Shared
Memory
Figure 1.2 Tightly Coupled System
Loosely ooupled systems are also referred to as multicomputers. In multicom-
puter systems, processors only have access to their own local memories and processors
communicate through message passing as shown in the system of Figure 1.3.
4Loosely coupled systems are easy to build with the disadvantage of more complex
software. Software designed to run on distributed systems give them a high degree of
cohesiveness and transparency. Loosely coupled systems tend to he used for working
on many unrelated problems.
r---------
I
P
1
Processors
r---------
P
2
T----------
p
n
Ml M2 Mn
...
l
PE 1 I, PE 2 PEnI:
1
I
1
- - -- - - --
L. ____
---- -- -- - - --
Interconnection Network N
Figure 1.3 Loosely Coupled System
Local
Memories
Processing
elements
Initially, researchers strictly followed the common parallel programming paradigms:
shared variables (for tightly coupled systems) and message passing (for loosely cou-
pled systems). More recently, efforts to combine the advantages of multiprocessors
(easy to program) and multicomputers (easy to build) have lead to communication
paradigms that simulate shared memory on multicomputer systems [SZ90). These
paradigms allow multicomputers to communicate through Distributed Shared Memory
(DSM). Distributed Shared Memory is an attractive abstraction because it provides
processes with uniform access to local and remote information. This uniformity of ac-
cess simplifies programming, eliminating the need for separate mechanisms to access
5local state and remote state informat.ion. Several techniques have been proposed to
allow multicomputers to communicate through Distributed Shared Memory (DSM)
[BT91, LH89, MSRN93, MSZ93,. SZ90, FL92, AHJ91, GLL+90]. Each technique pro-
vides its own level of coherence. An important class of DSM implementations is one
which uses cache memories to improve efficiency. Brown, Afek, and Merritt proposed
cache-consistency pwtocols that provide a lower level of coherence. Such protocols
are useful for applications that do not require strict consistency among all sites in
a distributed system [ABM89, Br090]. Mizuno, Zhou, Singh, and Neilsen proposed
more efficient algorithms which enforce the same level of coherence as Brown's proto-
col [MSRN93, MSZ93]. These protocols use the abstraction of a single copy of shared
memory to enforce sequential consistency. This provides the advantage of a simple
implementation and a clean correctness proof. However, a single copy of shared mem-
ory could become a bottleneck. Typically, if remote accesses to shared memory is
costly, this would also decrease performance and object availability.
1.1 Thesis
In this thesis, we present a decentralized cache-consistency protocol for DSM which
provides the same level of coherence as the protocols presented in [ABM89, Bro90,
MSRN93, MSZ93J. Our protocol distributes the shared objects among all processors
in the network providing an increase in performance and object availability. Our
protocol is not dependent on the system architecture, therefore allowing the algorithm
to scale to a large number of processors more efficiently than the protocols in [ABM89,
8r090, MSRN93, MSZ93]. As memory cost decreases and the cost of communication
become more expensive, we show that the increase in memory performance/cost of
our protocol is minimal as compared to the reduction in communication cost. We
prove that our protocol satisfies a formulation of sequential consistency. Next, we
provide an in-depth comparison/analysis of our protocol and the previously proposed
6protocols. Lastly, we show performance metrics of each protocol and explain which
protocol performs better or worse in various situations.
In summary, this thesis is three-fold:
1. To present a decentralized cache-consistency protocol for DSM.
2. To prove that the protocol enforces sequential consistency.
3. To provide a comparison of our protocol with proposed protocols.
1.2 Organization
The thesis is divided into the following chapters:
• Chapter 2: A literature review of cache-consistency protocols for DSM is pre-
sented.
• Chapter 3: A discussion of the decentralized cache-consistency protocol is pre-
sented.
• Chapter 4: A comparison of cache-consistency protocols for DSM is presented.
• Chapter 5: A summary of the thesis and suggestions for future work are pre-
sented.
• Appendix A: Related proofs are included.
• Appendix B: Memory and communication costs of the existing protocols are
presented.
• Appendix C: DSM simulation parameters are presented.
CHAPTER 2
LITERATURE REVIEW
The extent to which all processors can be kept busy depends on the computer
architecture, the tasks being performed, and the manner in which the task have been
programmed. A major concern in designing and programming efficient parallel ap-
plications is in avoiding conflicts in the use of shared resources e.g. memory. In order
to maintain an appropriate performance level, often multiple copies of shared data.
are maintained. In most distributed applications, all updates are performed on a
primary copy and all reads are performed on a local copy that is cached. The value of
a primary copy is replicated to remote cached copies once an update occurs. Repli-
cation introduces the problem of having inconsistent copies of the same logical data.
Complications also arise because the operations on shared data may not be instanta-
neous. A memory consistency model defines certain restrictions on the use of shared
memory. Applications that adhere to these restrictions are given guarantees about
the coherence of that memory. Several notions of consistency have been proposed in
the literature to implement DSM [HW90, Lam79, AHJ91, GLL+90, FL92].
2.1 Memory Consistency
Herlihy and Wing proposed the idea of linearizability, which is a correctness condition
for concun-ent objects that allows strict consistency providing a high level of coher-
ence. Linearizability provides the illusion that each operation applied by concurrent
processes takes effect instantaneously. Linearizability is more appropriate for appli-
cations such as multiprocessor operating systems in which concurrency is of primary
interest.
A correctness condition which provides a less restricted form of consistency than
7
10
rithm incurs less latency as compared to fBro90]. As shown in Appendix B, the cost
performance of this algorithm also requires an expensive atomic broBJdcastjmulticast.
Mizuno, Singh, Raynal, and Neilsen Algorithm
Mizuno, Singh, Raynal, and Neilsen proposed a memory consistency protocol in
[MSRN93, MSZ93] that allows the same set of sequentially consistent executions
as the protocol in [Bro90]. This protocol maintains additional information in shared
memory in order to reduce the amount of communication (i.e. no multicasting). The
architecture organization consists of a shared memory module (SMem), residing at
a network processor, and multiple processors. SMem keeps track of the most recent
write operation on each object as well as the values in the local cache of each processor.
This is done by maintaining state information and capturing causal relations among
read/write operations at SMem. All updates are performed on local cached copies
and also at SMem. All reads are performed locally if the object is presentjotherwise
the object is read from SMem. After each access to SMem, a process is notified of any
out-of-date values through an acknowledgment. As shown in Appendix B, this mem-
ory consistency protocol uses the communication verses memory and computation
trade-off to achieve efficient performance.
2.3 Definitions
As stated in the previous section, most DSM implementations are based on cache-
consistency protocol which use different variations of the notion of sequential consis-
tency. In this section, we review definitions of consistency on which our implementa-
tion is based. Some of the definitions and notations introduced in this section foHow
[MSRN93, MSZ93J. A shared memory system consists of a set of processors P and
a memory M. Ewch processor in P may execute a sequence of read and write oper-
ations on objects in M. A write operation by processor i on an object x is denoted
11
by Wi(X)V, where v is the value written on x by this operation. A read operation on
x by i is denoted by r,:(x)u, where u is the value of x returned by this operation. For
simplicity, we assume that all values written by read and write operations are distint.
An execution history of a shared memory system is a poset (; = (U, -+u), where
U is a set of read and write operations and -+u is an irreflexive and antisymmetric
relation on U; that is, -+u is a partial order on U. In the following we give some
definitions:
• We say that an execution history fJ = (U, -+u) is processor-ordered if the oper-
ations of each processor in U are totally ordered by -+u.
• An execution history S= (8, -+8) is a sequential history if it is processor-ordered
and -+8 is a total order.
• A sequential history S= (8, -+8) is legal if for every read operation r( x)v in 8,
there exists a write operation w(x)v such that w( x)v -+s r(x)v and there does
not exist a write operation w(x)u such that w(x)v -+8 w(x)u -+8 r(x)v.
• A restriction of V = (V, -+v ) to the set U, where U ~ V, is an execution history
(; = (U, -+u) such that for any operations 0 and 0' in U, 0 -+u 0' iff 0 -+v 0'.
• We define (; I i to be the restriction of history (; to the set of operations
performed by i.
• Two execution histories Sand (; are equivalent if for every processor i, S I i =
[/ Ii.
•' Two execution histories S = (8, -+8) and (; = (U, -+u) are result-equivalent
if S = U; that is, corresponding read operations return the same value and
corresponding write operations write the same value on both Sand (;. For
12
example, WI (x)l, W2( x )2, r2(x)1 and Wl (x )1, r2(x )1, w2(x)2 are result-equivalent
but not equivalent.
• (; = (U, --+v) respects V = (V, --+v) if V ~ U and for any two operations 0 and
0' in V, if 0 --+v 0' then 0 --+u 0'
Definition 1: A memory M is consistent if for each of its execution histories H,
there exists a legal sequential execution history WR = (WR, --+WR), where W R is
the set of aU read and write operations in II, such that II and WR are equivalent.
Definition 2: A memory M is consistent if for each of its execution histories H,
there exists a legal sequential history W = (W; --+w), where W is the set of all write
operations in iI, such that the following property holds for each processor i
(a) Let W R i = W u~, where R is the set of read operations performed
by processor i in iI. Then, there exists a legal sequential history WR =
(WR, --+WR) such that WR i respects Wand iI I i = W14 Ii.
It has been shown in [MSRN93! that Definitions 1 and 2 are equivalent. Def-
inition 1 considers a sequential history for the entire system. This consists of the
read and write operations issued by all processors. Definition 2 considers a sequential
history for each processor i. This consist of the write operations issued by all the
processors and the read operations issued only by processor i. We will use the defini-
tions in Appendix A, to prove that our protocol satisfies a formulation of sequential
consistency.
CHAPTER 3
PROBLEM STATEMENT
The previously proposed protocols in [Br090, ABM89, MSRN93, MSZ93], each
require the abstraction of a single processor centralized memory to enforce the real-
time ordering on writes. As this simplifies the implementation and provides a dean
correctness argument, in reality this strategy would perform poorly. Particularly, as
the number of processors/objects increase, and each processor accesses shared mem-
ory more frequently. In order to maintain an efficient performance, [Br090, ABM89,
MSRN9'3, MSZ93] assume the architecture consists of a set of processors connected
by a shared bus. In this thesis, we consider a larger scale system architecture in
which computers are logically fully-connected and communicate over costly point-
to-point links. Due to the cost of remote accesses, a single processor centralized
memory strategy would become a bottleneck; decreasing the performance and object
availability.
In this chapter, we present a decentralized cache-consistency protocol for DSM
which manages objects distributed among all processors in the system. This pro-
vides an increase in access performance due to the locality of reference. It also allows
the algorithm to scale to a large number of processors/objects more efficiently than
the previous protocols, by avoiding the bottleneck of a single processor centralized
memory. Our protocol preservers the real-time ordering on write operations, and al-
lows the same set of sequentially consistent executions as [Br090, MSRN93, MSZ93]
without requiring atomic broadcast/multicast. As memory cost decrease and the
cost of communication become more expensive, we show that the increase in memory
performance/cost of our protocol is minimal as compared to the reduction in commu-
nication cost. In the following sections we give an overview of the protocol, followed
by a description of two implementations of the the protocol. Finally, we show the
13
14
performance of the protocol in terms of memory and communication cost.
3.1 Overview of Protocol
We assume that the system consist of logically fully-connected autonomous computers
communicating across point-to-point links. Each processor contains two threads of
control, a Processor Manager and an Object Manager, which share a single address
space. This address space contains state information to capture causal relations of
read/write operations to an object, and to notify a processor of invalid objects. Each
processor initially owns a set of objects, and no two processors own the same object.
An owner of an object, owns the most consistent version of an object, as updates to
an object are only allowed to be processed by the owner of that object. This allows
the real-time ordering on write operations to an object to be preserved; but only with
respect to the owner of the object. Therefore, all other processors only maintain local
cache copies. Each processor manager communicates with the owner of an object for
a read/write request, if:
1. During a read to an object not currently owned, the value in the cache is invalid.
2. A write operation is issued to an object not owned by the current process.
Otherwise, the read/write operation is performed locally.
3.1.1 Distributed Manager Implementations
In the next section, we describe two management schemes used to keep track of the
owner of an object. A primary problem with distributed manager schemes is the
initial distribution of objects. As show in Figure 4.7 , an optimal solution would be
to distribute an object to a processor who accesses the object most frequent.
15
Fixed Distributed Manager
The fixed distributed manager scheme distributes the central manager's (SMem) role
to every processor in the system, thereby avoiding a single processor bottleneck sit-
uation. In this scheme, every processor keeps track of owners of a predetermined set
of objects (determined by a mapping function H) [LH89]. The primary difficulty in
such a scheme is choosing an appropriate mapping from objects to processors. If we
assume there are M objects in the system and I = {l, ... ,M}. H is defined as a
hashing function such that
H(p) =p mod N
where pEl and N is the number of processors. Therefore, when processor i requests
to access an object p, processor i contacts tbe object manager H(p), and the protocol
proceeds as in the centralized protocol in [MSRN93, MSZ93].
Dynamic Distributed Manager
In the dynamic distributed manager scheme, every processor keeps track of the own-
ership of an object in its local cache. This is maintained through the use of the
vector Probowner [LH89]. The value Probowner[o] contains the owner of object o.
As processors that frequently access an object can cause the object to migrate, this
value can either be the true owner or the probable owner of an object. This value is
used as a hint to locate the true owner of an object.
When a processor wants to perform a remote operation on some object 0, it sends
a request to the processor i indicated by the Probowner[o] field. Upon receipt of the
request, if processor i is the true owner of the object the algorithm proceeds as in the
centralized protocol described in [MSRN93, MSZ93]. Otherwise, processor i forwards
the request to the processor indicated in its Probowner[0] field. This continues until
the true owner of the object is found. The hint in the Probowner[o] is updated after
16
every remote operation to object o. In Appendix A, we show that the implementation
of the dynamic distributed manager algorithm requires at most (N - 1) forwarding
request messages to locate an owner of an object in a system containing N processors.
In the optimal case, only one extra message is require to forward a request; assuming
the hint of the probable owner is correct. Because, the hints are updated as a side
effect of different migration policies, the average number of messages required should
be much less.
Migration Policies
Our dynamic distributed manager scheme allows objects to migrated between pro-
cessors. This introduces the notion of a migration policy which could upgrade or
degrade the performance of the protocol due to the locality of reference. There are
two policies that can be used: Random Policy and Threshold Policy. OUf thesis is
only concerned with the threshold policy. We consider migration on read, write, and
read/write accesses.
• Random Policy - The random policy is a simple migration scheme that uses no
state information. An object 0 is simply migrated to process i after process i
request a remote operation on object o. The problem with this approach is that
useless object migration can occur when an object is migrated to a processor
that doesn't access it frequently.
• Threshold Policy - The problem of useless object migration under the random
policy can be avoided by maintaining statistical information of an object most
frequently accessed by a processor. Based on locality of reference, this strategy
chooses the best processor to engage in migration of an object.
17
This is very costly in terms of memory. Each processor must maintain a thresh-
old vector T of size N x M; where N denotes the number of processors and M
denotes the number of objects. Moreover, T[p, 0] contains the expected number
of accesses by processor p on object o.
3.1.2 Data Structures
Each processor manages the foUowing data structures:
Let N denote the number of processors and M denote the number of objects.
Fixed Distributed Manager Scheme
1. Memory area C[M]. Ci[M} contains the values cached at processor i.
2. One-dimensional array Causal[M], used to capture causal relations among write
operations. Causali[o] keeps the version number of the most recent write on
object 0 of processor i.
3. A set of valid cache objects valid. The set validi is initialized to the objects
owned by processor i.
Dynamic Distributed Manager Scheme
1. Same as the Fixed Manager Scheme.
2. One-dimensional array Probowner[M). Entry Probownerdo] contains the hint
of the owner of object 0 by processor i.
3.2 Description of Protocol
In this section, we provide the actual description of the protocol using a syntax similar
to the C programming language. We denote all elements in a one dimensional array R
18
by R[*]. Note that all operations on an object local or remote are executed atomically.
Therefore, simultaneous updates to local memory by the processor manager or the
object manager are synchronized.
Variable Definitions
In this section, we define and motivate all variables used in the our description of the
protocol.
1. Let x be an integer value denoting the object to access in the cache.
2. Let v represent any data structure or block of data structures to he stored in
shared memory.
3. Let valid represent a set of integer values to denote the valid objects stored in
the local cache.
4. Let Causal be an integer vector used to capture causal relations among write
operations to a shared object.
5. Let C be a vector of the type v to represent the shared objects maintained in
the cache.
6. Let i and j be integer values to denote the processor id.
7. Let Probowner be an integer vector used to denote the owner of an object.
19
Fixed Distributed Manager
In this section we provide a description of the decentralized protocol using the fixed
distributed manager scheme.
OBJECT MANAGER at processori:
Process [write,j,x,v, CausaljH ] message from processorj ::
Cdx] = Vj
increment(Causali[xJ);
Invalidate(Invalidi);
validi = (validi - Invalidi) u {x};
send[Causah[*]] message to processorj
Process [read,j, x, Causalj(*] ] message from processorj ::
Invalidate(I nvalidi)j
validi = (validi-Invalidi) U {x};
send[Ci[x], Causali[*]J message to processorj
Procedure Invalidate(var Invalid) ::
Invalid = '0j
For each y EM, Y # x do
if (Causali[Y] < Causalj[Y]) then
Causali[Y] = OJ
Invalidi = Invalid; U {y}
endif
enddo
PROCESS MANAGER at processori:
write (x, v) ::
if (H(x) # i) then
send[write, i, x, v, Causali[*]] message to proceSSorH(x)
receive[Causalj[*]J message from processoTH(x)
InvaJidate( I nvalidi);
Causali[x] =Causalj[x];
valid; = valid; - I nvalidi ;
else
increment (Causali[x));
endif
validi = validi U {x};
Ci[X] = Vj
20
read(x) ::
if x ¢ Validj then
send [read,j, x, Causalj[*] } message to prO€eSSoTH(x)
receive [v,Causalj[*] ] message from procesSOTH(x)
Invalidate(Invalidj);
Causali[x] =Causalj[x];
validj = (validj - Invalidj) U {x};
Cj[x] = Vi
endif
return(Cj[xD;
Dynamic Distributed Manager
In this section we provide a description of the decentralized protocol using the dy-
namic distributed manager scheme.
OBJECT MANAGER at prOCeSSOTj:
Process [write/ forwardw,j, x, v, Causaljf*] ] message from processorj ::
if (Probownerj[x] == i) then
checkthreshold(x, j);
Ci[X] = v;
increment(Causalj[x));
Invalidate(I nvalidi);
validi = (validj - Invalidd U {x};
send[Causaljf*] , Probownerj[x]] message to processorj
else
send[jorwardw,j, x, v, Causali!*]] message to processorprObOWneTj[X)
Process [read/ fOTwardr,j, x, Causalj[*J ] message from proce.'Jsorj ::
if (Probownerj[x] == i) then
checkthreshold(x, j)j
Invalidate(Invalidd;
validi = (validi - I nvalidi) U {x};
send[Ci[x], Causalil*] , Probownerj[x]] message to processorj
else
send[jorwardr, j, x, Causal;[*]] message to processorpTobowneTilx]
Procedure Invalidate(var Invalid) ::
Invalid = 0;
For each y EM, Y :I: x do
if (Causali[Y] < Causalj[Y]) then
Causali[Y] = 0;
Invalidi = lnvalidi U {y};
endif
enddo
Procedure checkthreshold(x,j) ::
increment(TilJ, x])j
if Ti Ii, x] > t then
Probowneri[x] = j;
endif
Procedure resetthreshold(x) ::
if (Probownerj[x] == i) then
for each j E N do
Tjfj, x] = OJ
enddo
endif
PROCESS MANAGER at proceSSOTi:
write (x, v) ::
if (Probowneri[x] =I i) then
send[write, i, x, v, Causali[*]] message to processorProboumer;[x)
receive(Causalj [*], owner) message from processorProbowneT; [x]
Probownerdx] = owner;
Invalidate(Invalidj);
resetthreshold(x)j
Causali[x] =Causalj[x]j
validj = validj - Invalidi;
else
increment Causali[x]j
endif
validj = validj U {x}j
Cj[x] = v;
read(x) ::
if X (j. Validj then
send [read,j,x,Causah[*]] message to processorprobownerifx]
receive lv, Causalj[*], owner] message from processorprobowner;(x]
Probownerdx] = ownerj
Invalidate(Invalidi );
resetthreshold( x);.
Causali[x] =Causalj[x]j
validj = (validi - Invalidi) U {x};
Ci[X] = v;
endif
return(Cilx])j
21
22
Cost Performance
Our decentralized protocol requires one round of message exchange for a write oper-
ation if the current process is not the owner of the object; else the value is written
to the local cache. A read operation requires one round of message exchange if the
value in the local cache is not valid and the current process is not the owner of the
object; else the value is read from the local cache. We provide a protocol similar to
[MSRN93, MSZ93], that does not require an atomic broadcast capability, and utilizes
less message rounds due to the locality of reference (refer to Appendix B).
In particular, if we consider performing all read operations, our protocol provides
about the same level of performance as the protocols in IMSRN93, MSZ93]. Although
on the average, our protocol requires slightly fewer messages than the single shared
memory protocols in [MSRN93, MSZ93]. This is because remote reads to objects
owned by a process can be performed locally, were as, it must always be performed
remotely using the protocols in [MSRN93, MSZ93].
CHAPTER 4
PERFORMANCE ANALYSIS AND RESULTS
In this chapter, we simulate our protocol and the protocols presented in [BrogO)
MSRN93, MSZ93]. We analyze the behavior of each protocol under various condi-
tions, and show which protocol behaves better or worse using different metrics. Our
simulation consists of a process scheduler, which schedules discrete events involving
multiple processes. Processes created can communicate by using sendO and l"e-
ceiveO fuctions. The simulation provides several process synchronization techniques
such as: send/receive, signal/wait, and release/acquire. We run each simulation us-
ing a 486DX4 100-MHz computer running the Linux Operating System. We assume
the simulation parameters given in Appendix C, and the maximum duration of any
protocol to be 10000000 (i.e. in simulation ticks).
4.1 DSM Simulation
4.1.1 Performance Metrics
For each protocol we assume the simulation parameters given in Appendix C. We
model the performance of each protocol and provide various metrics such as: local
access efficiency, average time for an operation, average wait time for an operation,
average number of forward messages per forward request, and the comparison of
performance between two algorithms.
Given the total access time t a , and the percentage of local access operations
P1ocal, we define an access time ratio r = ta.mem/t%x:.o.l; where ta,OCo.l is the Local
Read/Write time and t a•mem is the Remote Read/Write time. We calculate the Re-
mote Read/Write time as the time to send the request to shared memory, process the
request, and receive the result/complete (refer to Appendix C for actual times).
23
24
Because Brown's Algorithm requires different access times for remote reading and
writing, the access time at shared memory ta.~mem is defined as follows:
ta..mem = RemoteReadTime* p + WritelnvalidationTime* (1 - p)
where p is the probability of performing a read operation. For all other algorithms,
the remote read/write access times are equal (refer to Appendix C). We define the
average access time taa.lle as:
Eq. 1
We define the local access efficiency e = taIOCa.Jtaa"e' to be the ratio of local access
time to the average access time [Hay88]. This determines the factor by which t aalle
differs from its minimum possible value talocal' From Eq. 1 and r = ta.mem/t/Jloca.P we
obtain
e = 1/(1' + (1 - r) *Ploca,) Eq. 2
The wait time per operation is defined as the time spent waiting for shared memory
to perform the request. Due to contention of processors for shared memory, this could
vary among the different algorithms. We relate the performance of two algorithms,
say X and Y, by showing how much of a percentage faster X is than Y. This is
denoted as follows
P/aIJter = ((ExecTimey - ExecTimex)/ExecTimex) * 100
25
4.2 Analysis
In the next sections, we analyze the behavior each protocol in various conditions, and
show which protocol behaves better or worse and by what metrics. Except for as
designated in the following figures, we assume:
• the total number of processors is 100.
• the total number of objects is 100.
• the total number of operations is 1000 per processor.
• the probability of performing a read/write operation is equally likely (i.e. 0.5)
• objects are selected from a uniform distribution.
4.2.1 Centralized Protocols
In this section we show the performance of the algorithms proposed in [Br090, MSRN93,
MSZ9'3].
As defined in Eq. 2, e is calculated as a fraction of P1ocal. Figure 4.1 shows that it
is important to achieve high values of Plocal (between 0.9 and 1.0), in order to make
e ~ 1 (i.e. taa"e ~ talocal)' Because all writes must be performed at shared memory,
the local access efficiency directly depends on the probability of reading an object;
which can be performed local or remote. If the probahility of reading is very high, the
local access efficiency increases; while the access time decreases. Based on Figure 4.2,
these algorithms will probably perform better-in terms of object accessibility-if
the application using these algorithms consisted of more reads than writes.
i
I
---1
26
Figure 4.1 Access Efficiency vs the Probability of Reading
Figure 4.2 Average Access Time vs the Probability of Reading
27
Figures 4.3 and 4.4 show the average time of an operation as the number of pro-
cessors scale from 50 to 1000 and the total objects scale from 50 to 400; assuming a
point-to-point architecture. If we consider the parameters in Appendix C-such as
LATENCY, Local Read, and Remote Read/Write Time-and the number of proces-
sors in Figure 4.3, the access time for Browns algorithm can range from 1 to 105105
clocks, and 1 to 202 clocks for the Centralized protocol. As predicted, Brown's algo-
rithm performs much worst. This is due to the expensive mult,icast implemented as a
set of point-to-point messages, which increases the write access time linearly with the
number of processors .. As the number of objects increase, both algorithms maintain
a constant average access time.
Figure 4.3 Average Access Time vs Number of Processors
CD
E
- ..
t-
el)
fI) •..
CD·····
u·:··.....··
o
«
,,: OSr:,o:w ns .
. ... <>Centra(l~:ed
Figure 4.4 Average Access Time vs Number of Objects
28
Both the algorithms presented in {Bro90, MSRN93, MSZ931, perform all writes and
reads to/from the single processor centralized memory in a single atomic operation.
This creates a bottleneck typically when several processors are contending to access
the centralized shared memory. This results in a wait time at the centralized memory
until the request can be handled. In Figures 4.5 and 4.6, we show the average wait
time as the number of processors scale from 50 to 1000, and the number of objects
scale form 50 to 400. We will show that our Decentralized protocol scales the number
of processors much better by distributing objects uniformly to all processors; which
in turn decreases processor contention, as well as, access time.
29
Figure 4.5 Average Wait Time vs Number of Processors
I
-,
Figure 4.6 Average Wait Time VB Number of Objects
30
4.2.2 Decentralized Protocol
In this section we show the performance of our protocol as compared to the cen-
tralized protocol in [MSRN93, MSZ93]. We examine both the fixed manager and
dynamic manager schemes to implement our protocol. If we consider the parameters
in Appendix C,such as LATENCY, Local Read, and Remote Read/Write Time, then
the access time for our decentralized protocol ranges from 1 to 202 ticks.
In certain situations the decentralized algorithm could provide an optimal level of
performance. Figure 4.7 shows that if a processor accesses the objects it owns more
frequently, the average access time per operation decreases significantly, as the local
access efficiency increases. This is due to the locality of references to processor owned
data objects.
Figure 4.7 Locality of Reference
31
Although migration in the dynamic manager scheme can be performed on read, write,
and read/write accesses, Figure 4.8 shows that read/write migration yields a lower
average access time for low values of the threshold t. As the threshold increases, the
average access time for all migration schemes approaches to 83.0.
Figure 4.8 Average Access Time VB Threshold
As proved in Appendix A, the worst case number of forward request using the dynamic
manager scheme is N - 1; where N is the total number of processors. Because,
the hints in the Probowner field are updated as a side effect of different migration
policies, overall the average number of messages required should he much less; and
is dependent on the threshold IDevelt and the probability of performing a remote
operation (1 - Plocal)' Figure 4.9 shows the average forward messages per forward
request.
32
00 "5 Rea:d:::Onl y : .60:5:·:W·~I·i:8:;::·~:~:::I~~'::: '.
~.().i.R~~c":·/:W·tl:'.t·.;:·: ::
Figure 4.9 Average Forward Messages/Forward Request
For the remaining analysis, we assume a 50% chance of selecting an owned object
using both the fixed and dynamic manager schemes. We also assume read/write
migration for the dynamic manager scheme. As shown in Figures 4.10 and 4.11, the
decentralized algorithm performs significantly better. In particular, the local access
efficiency ,e approachs 1 slightly faster than the centralized algorithm (Figure 4.10).
This is due to the locality of reference; as reads and writes can both be performed
locally if the object is owned by the current processor.
33
Therefore, our algorithm performs better than the centralized algorithms in cases
were reads/writes are equally likely to occur. OUf protocol scales more efficiently than
the centralized protocol as the number of processors increases from 50 to 1000 (Figure
4.11). Particularly, in maintaining low access times for low numbers of processors;
increasing up to the maximum value of 202 ticks which is constantly maintained by
the centralized protocol.
rn
(I) ..
Q).
(,)....'
(,). ',..
<II(
..•.••Pro balii lityof:R e::~d·:
Figure 4.10 Access Efficiency vs the Probability of Reading
34
•E
I-
•a
.'('D., ..
,.....,;,.<,..
. .<•...
>-
:>C
Figure 4.11 Average Access Time VB Number of Processors
Figure 4.12 shows that for a low number of processors, the decentralized protocol
maintains a low wait time; which increases up to a maximum of 202 ticks as the num-
ber of processors increase. The centralized protocol maintains a significant increase
in waiting time; due to the contention among processors to the centralized shared
memory. This in turn increases the average access time; which is at a maximum of
202 dock ticks (refer to Figure 4.11).
35
Figure 4.12 Average Wait Time vs Number of Processors
For our simulation,. the decentralized protocol distributes objects to processors
uniformly. Figures 4.13 and 4.14, show that as the number of objects scale from 50
to 400, the average access/waiting time per operation decreases for the decentralized
protocol and is constant for the centralized protocol. In the decentralized protocol,
if objects are not distributed among processors evenly, or if the number of objects
is less than the number of processors, then the possibility of processor contention
to the shared objects increases; which in turn increases the access/wait time per
operation. Therefore, it is probably best to distributed objects to processors evenly,
and to processors which access the object more frequently.
36
Figure 4.13 Average Access Time vs Number of Operations
Figure 4.14 Average Wait Time vs Number of Objects
37
Finally, we compare the performanc~interms of execution--of our protocol to
the single shared memory protocol in [MSRN93, MSZ93]. As shown in Figure 4.15, the
fixed manager scheme performs better than the dynamic manager scheme. This could
be due to the overhead associated with forwarding messages which could increase the
locality of reference, as well as, the average access time. Given that 50% of all accesses
are performed on owned objects with a 50% chance of reading and writing, and there
are 100 processors and 100 objects, our algorithm is significantly faster than the
centralized protocols. As shown in Figure 4.15, these assumptions directly effect the
performance; as the locality of reference is very high. If the assumptions are such
that the locality of reference is very low, our protocol's performance decreases (Figure
4.16 and 4.17).
Figure 4.15 Comparison of DSM Algorithms .5 read, .5 owner selection
38
The assumptions made about the simulation throughout the thesis may provide
results which are misleading. In order to justify our results, we provide a performance
comparison using an 80% probability for a read, and a 0/20% probability of selecting
an owned object. As shown in Figure 4.16 and 4.17, the decentralized algorithms still
perform better due to the localitty of reference; as read and write operations can both
be performed locally.
Figure 4.16 Comparison of DSM Algorithms .8 read, .2 owner selection
-39
".Flxed:v:sD.,y,n ilm Ie
:::.Dynam icvsfMS RN ,MS.R)
.Fix e:d .:v's::lM SRN.,MS·~l· ,.. ,....
Figure 4.17 Comparison of DSM Algorithms .8 read,. 0.0 owner selection
CHAPTER 5
CONCLUSION
5.1 Summary
In this thesis, we presented a decentralized cache-consistency protocol for DSM which
manages objects distributed among all processors in the system. Our protocol pre-
servers the real-time ordering on write operations, and allows the same set of sequen-
tially consistent executions as [Br09G, MSRN93, MSZ93] without requiring atomic
broadcast or multicast. In contrast, we prove that our protocol enforces sequential
consistency (Refer to Appendix A). We give performance metrics to show that our
protocol provides an increase in access performance due to the locality of reference;
and scales to a large number of processors more efficiently than the previous proto-
cols, by avoiding the bottleneck of a single processor centralized memory. Although
our protocol requires additional state information, the tradeoff of memory cost to
communication cost provides reduction in overall communication performance.
5.2 Future Work
In this thesis, we are concerned with the performance of our protocol. Future work
in protocol performance would be to reduce memory cost using a method simalar to
[MSZ93]. Other future work would be to look at fault-tolerance issues that could effect
the performance of the protocol such as transient failures. More work needs to be
done simulating different distributions to distribute, access, and migrate objects.
40
BffiLIOGRAPHY
[ABM89] Y. Afek, G. Brown, and M. Merritt. A lazy cache algorithm. In Proceed-
ings of the ACM Symposium on Parallel Algorithms and Architectures,
pages 209-222. ACM, 1989.
[AHJ91] M. Ahamad, R. Hutto, and R. John. Implementing and programming
causal distributed shared memory. In Proceedings of the IEEE Inter-
national Conference on Distributed Computing Systems, pages 274-281.
IEEE, 1991.
[Bro90]
[BT91]
[FL92]
G. Brown. Asynchronous multicaches. Distributed Computing, 4:31-36,
1990.
H. E. Bal and A. S. Tanenbaum. Distributed programming with shared
data. Computer Languages, 16:129-146, 1991.
M. Feeley and H. Levy. Distributed shared memory with versioned ob-
jects. Technical Report TR-92-03-01, Department of Computer and En-
gineering , University of Washington, 1992.
[GLL+90] K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and
J. Hennessy. Memory consistency and event ordering in scalable shared
memory multiprocessors. In Proceedings of the 17th Annual Symposium
on Computer Architecture, pages 15-26. Computer Architecture News,
1990.
[Hay88] J. P. Hayes. Computer Architecture and Organization. McGraw-Hill, Inc,
1988.
41
[HW90]
42
M. Herlihy and J. Wing. Linearizability: A correctness condition for
concurrent objects. ACM Transactions on Programming Languages and
Systems, 12:463-492, 1990.
[Lam79] L. Lamport. How to make a multiprocessor computer that correctly exe-
cutes multiprocess programs. IEEE T1'ansactions of Computers, 28:690-
691, 1979.
[LH89] K. Li and P. Hudak. Memory coherence in shared virtual memory systems.
ACM Transactions on Computer Systems, 7:321-359, 1989.
[MSRN93] M. Mizuno, G. Singh, M. Raynal, and M. Neilsen. Communication ef-
ficient distributed shared memories. Technical Report TR-CS-93-3, De-
partment of Computer and Information Sciences, Kansas State University,
1993.
[MSZ93] M. Mizuno, G. Singh, and James Z. Zhou. A sequentially consistent
distributed shared memory. Technical Report TR-CS-93-4, Department
of Computer and Information Sciences, Kansas State University, 1993.
[Sta84] J. A. Stankovic. A perspective on distributed computer systems. IEEE
Transactions on Computers, 33:28-41, 1984.
[SZ90] M. Stumm and S. Zhou. Algorithms implementing distributed shared
memory. IEEE Comput,er, 23:54-64, 1990.
[Tan92] A. Tanenbaum. Modern Operating Systems. Englewood Cliffs, N. J.
Prentice Hall, 19'92.
APPENDfX A
RELATED PROOFS
43
44
A.I Proof of Prog.ram Correctness
In this section we prove that our protocol preserves the real-time ordering on write
operations and allows the same set of sequentially consistent executions as [BrogO,
MSRN'93, MSZ93]. The following proof is based on [MSRN93].
Theorem~ The implementation is consistent; that iS t it satisfies Definition 2.
Assumptions.:
(a) No two prooessors initially own an object simultaneously.
(b) All operations on an object local or remote are executed atomically.
(c) Assume, for all processors, that all objects 0 owned by a processor i make
up the distributed shared memory. If we let DSM == hI Uh2U...Uh N ; where
N is the number of processors. Since DSM is the global shared memory, all
writes are performed only on objects maintained in DSM. More formally,
let P == (PhP2, ... tPN) be a set of processors, 0 == (Ot,02, ... ,OM) be a
set of objects maintained in the system, H == (hI, h2, .•• , hN ) be a set of
objects owned by each processor such that V[Pi] E P, 3[hi n hj == 0]. Since
all updates to an object 0 E hi are performed only by processor pi, this
maintains the strict ordering among writes for each hi E DSM.
Proof: Let it be an execution history of the protocol for object o. In order to
show the implementation is consistent, by Definition 2, we have to show that:
(i) We can construct a sequential history Wo == (Wo , -+wJ, where Wo is the
set of all write operations in It, (This preserves the real-time ordering on
writes for each object separately)
45
(ii) For each processor j we can construct a legal sequential history W;Rj =
(WoRj , --+WoRj)' where WoRj = WoURj and Rj is the set of read operations
performed by processor j on object 0 in it, such that WaRj respects ""0
and 110 I i = W;Rj I i iff process i owns o.
1. Now let Wo = (Wo, --+wJ be a history such that if 0 and 0' are operations in
Wo then 0 --+Wo 0', jf 0 is processed before 0' by the owner of the object denoted
OWNERo. Because the OWNERa processes the write operations sequentially,
Wo = (Wo, --+wJ is a sequential history. We will show (ii) by constructing a
legal sequential history W;Rj = (WoRj , --+WoRj) as follows:
(a) For any two operations 0 and 0' in WaRj which access OWNERa, 0 --+WO R1
0' if 0 is processed before 0'.
(b) For any two operations oJ and oJ performed by processor j, oJ --+WoRJ oJ,
if o} is processed before oJ.
(c) Let r j1 , rj2' .•. , r iN be a sequence of consecutive local read operations by
process j ( thus, Tjl --+WoRJ rh --+W"RJ ••• --+WoRj rjN due to the ordering
enforced by (b)). Let Oz be an operation by any processor z which accesses
OWN ERa and immediately follows OJ at OWNERa (thus, OJ -WoRJ Oz
due to the ordering enforced by (a)). Then, TjN --+W"Rj OJ-
2. From (a) and the fact that all operations in Wo access the OWNE Ra , we have
that W;Rj = (WoRj , --+WoR,) respects Woo From (b), fIo I j = W;Rj I j.
Finally, we win show that W;Rj is legaL
Proof: Assume that WoRj is not legal for any processor j. Then there must
exist a read operation r(x)v such that w(x)v --+WOWNERzRJ w(x)u --+WOWNERzRJ
T(X)V and there does not exist w(x)s such that w(x)u --+WOWNERzRj w(x).s --+WOWNERzRJ
r(x )v. There are three cases to consider:
46
Case 1: Operation r(x) performed by processor i, accesses OWNERr = z.
Then the last value written by any processor j is the most recent write. There-
fore, history w(x)v --+WOWNERxRJ w(x)u --+WOWNER%Rj r(x)v never occurs.
Case 2: Operation r(x) accesses OWNERx, and processor i is not the owner.
Clearly, from the protocol, w(x)u writes u to COWNER%[X], and r(x) is performed
after w(x)u to OWNERx ' Thus, r(x) does not return v, and the history
w(x)v --+WOWNERxRj w(x)u --+WOWNERxRJ r(x)v never occurs.
Case 3: Operation r(x) is a local read and processor i is not the owner of x.
Let OJ be the last operation before i(x) by processor j which accesses some
shared object . There are three cases to consider:
(a) w(x)v --+WOWNER",RJ w(x)u --+WOWNER%RJ OJ --+WOWNERxRj r(x)v: In this
case, assume the owner of the object is processor k. There are two cases
to consider:
(1.) w(.x)u is issued by processor j: Then w(x)u sets Cdx] = 'l.l, Causalk[x]
is incremented, and Gj[x] = u, Causalj[x] = Causalk[x), and validj[x] =
1. Since there does not exist w( X)8 ordered by --+ WkR
J
in between w( x)u
and rex), the values of validj[x] and Cj[x] stay unchanged at least until
r(x) is performed. Thus, r(x) locally reads value u from Cj [x], and history
w(x)v --+WOWNER",Rj w(x)u --+WOWNER",RJ r(x)v never occurs.
(2.) w(x)u is not issued by processor j: Execution of w(x)u sets Ck[x] = u
and Causalk[x] is incremented. After OJ accesses the OWNERx = k,
validj[x] = 1 at processor j.. Since r(x) is a local read, validj[x] must
be 1 when rex) is performed by processor j. This means validj[x] has
been changed to 1 before OJ is completed. From the protocol, validj[x]
can be changed to 1 only if a read or write operation on x by processor
j is performed at OWNERx = k. By the assumption, there does not
47
exist w(x)s ordered in between w(x)u and rex) by --+WOWNER,rRJ" There-
fore, there must be a read operation by processor j which reads Ck[x] at
OWN ERx = k between w(x)u and OJ, including 0,, This read operation
also sets validj[x] = 1 and Cj[x] = u. Thus, rex) returns u, and history
w(x)v -+WOWNERrRJ w(x)u --+WOWNERrR] r(x)v never occurs.
(b) OJ is w(x)u: Then, the operation w(x)u sets validj[x] = 1 and Cj[xj = u.
Since there is no operation by processor j which accesses an object at
OWNERx = k between w{x)u and r(x),r(x) returns u. Thus, history
wex)v --+WOWNERrR) w(x)u --+WOWNERrR) r(x)v never occurs.
(c) OJ --+WOWNER;rRJ w(x)u: In this case~ rule (Ie) above orders rex) in between
OJ and w(x)u. Hence, a history w(x)v --+WOWNERrRJ w(x)u -+WOWNERrR;
r( x)v never occurs.
48
A.2 Proof of Bound on Forward Messages
The two critical questions about this algorithm are whether forwarding request even-
tually arrive at the true owner and how many forwarding request are needed in the
worst case. In order to prove these questions, consider aU Probowners of an object 0
as a directed graph Go = (V, Eo) where V is the set of processors numbered 1,... ,N,
and I Eo J= N and an edge (i,j) E Eo, iff the Probowner for an object 0 on processor
i is j. The foHowing proof follows [LH89].
Theorem: A request for an object assumed to be owned by the hint in
Probowner, will reach the true owner in at most N - 1 forwarding request
messages.
Lemma: Because read/write request are executed atomically and migration
is done only during a remote operation, this ensures that migration occurs
sequentially. Assuming migration takes place in the worst case (i.e. on every
access), every Probowner graph Go = (V, Eo) has the following properties:
1. there is exactly one node i such that (i, i) E Eo;
2. Graph G~ = (V,Eo - (i,i) is acyclic; and
3. for any node x, there is exactly one path from x to i.
Proof: By induction on the number of migrations of object 0, all Probowners of
the processors in V are initialized to a default processor, and all three properties
are satisfied. After one migration of object 0 , say from i to j, the node (i, i)
(i .e. the current owner of object 0) is deleted from Eo, and the node (i, j) is
inserted into the Probowner graph Go. This ensures there is only one path
from i to j; satisfying property 3. As node (i,j) was the root, the subgraphs
are still pointing to i and remain unchanged and are acyclic; satisfying property
49
2.. The node (j, i) is deleted from Eo, because j now becomes the owner (root of
i); therefore a new node is inserted into the graph Go denoted Eo = (j,i). This
satisfies property 1. After k migrations of an object 0, the Probowner graph
Go satisfies the three properties.
Proof: By Lemma 1, there is only one path to the true owner and there is no
cycle in the Probowner graph. So, the worst case occurs when the Probowner
graph is a linear chain
in which case the number of forwarding request is N - 1 when processor VI
request an operation from processor VN.
APPENDIX B
COST/PERFORMANCE
50
Table B.l.
Memory Cost Performance
Assumptions: K = # of object, N = # of processors
Protocol Process Memory Size
[ABM89] Processor IN Queue Unbounded
OUT Queue Unbounded
Cachet ] K
Shared Memory Cachet ] K
[Bro90] Processor Queue Unbounded
Cachet ] K
Shared Memory Cachet ] K
(MSRN93] Processor Valid[ ] K
Cache[ ] K
SMem M[ 1 K
Cache_Vert ][ ] NxK
Causal( ] K
[MSZ93] Processor Valid[ J K
Cachet ] K
SMem hlw[ ][ ] N x K binary
vector
, Causal[ J K
Fixed Processor C[ ] K
Manager valid[ ] K
Decentralized Causal[ ] K
Distributed Processor C[ ] K
Manager Probowner[ ] K
Decentralized valid[ J K
Causal[ ] K
51
Table B.2.
Communication Cost Performance
Message [BrogO, ABM89] [MSRN93] [MSZ93] Decentralized
Reads oif valid 1 if .., valid oif owner
1 if.., valid oif valid 1 if.., owner
Writes N 1 oif owner
1 if.., owner
Forwards N/A N/A Best case: 1
Worst case: N - 1
52
APPENDIX C
SIMULATION PARAMETERS
53
54
Table C.l.
Simulation Parameters - Browns Protocol
Assumptions: No Special Hardware, N = # of processors
Process Operation Clock Cycly Time
Process Manager Read/Write Local 1
Remote Read 10
Write SMem + Invalidations 10 + (10 * (N-l))
Equeue Invalidations 1
Communication Latency 100
Duration between Operations 5
Shared Memory Process Request 2
Table C.2.
Simulation Parameters - Mizuno, Raynal, Singh, and Neilsen Protocol
Process Operation Clock Cycly Time
Process Manager Read/Write Local I
Remote Read/Write 10
Communication Latency 100
Duration between Operations 5
SMem Process Request 2
Table C.3.
Simulation Parameters - Decentralized Protocol
Thread Operation Clock Cycly Time
Process Manager Local Read/Write 1
Remote Read/Write 10
Communication Latency 100
Duration between Operations 5
Object Manager Process Request 2
Forward Request 2
55
/\
,
C'
VITA
Legand L. Burge III
Candidate for the Degree of
Master of Science
Thesis: A DECENTRALIZED ALGORITHM FOR
COMMUNICATION EFFICIENT DISTRIBUTED
SHARED MEMORY
Major Field: Computer Science
Biographical Data:
Personal Data: Born in Stillwater, Oklahoma on February 5, 1972,
the son of Dr. L. 1. Burge Jr. and Gwenetta V. Burge
Education: Graduated from John Marshall High School, Oklahoma City,
Oklahoma, 1989; received Bachelor of Science in Computer
Science from Langston University, Langston, Oklahoma in 1992.
Completed the requirements for the Master of Science degree
in Computer Science at Oklahoma State University in July 1995.
Experience: Computer Analyst, National Security Agency, Ft. George
G. Meade, Maryland, 1991 to present.
