Mermera: Non-Coherent Distributed Shared Memory for Parallel Computing by Sinha, Himanshu Shekhar
Boston University
OpenBU http://open.bu.edu
Computer Science CAS: Computer Science: Technical Reports
1993-04
Mermera: Non-Coherent
Distributed Shared Memory for
Parallel Computing
Sinha, Himanshu. "Mermera: Non-coherent Distributed Shared Memory for Parallel
Computing", Technical Report BUCS-1993-005, Computer Science Department,
Boston University, April 1993. [Available from: http://hdl.handle.net/2144/1456]
https://hdl.handle.net/2144/1456
Boston University
MERMERA:
NON-COHERENT DISTRIBUTED SHARED MEMORY
FOR PARALLEL COMPUTING
Himanshu Shekhar Sinha
April 1993
BU-CS-93-005
Computer Science Department
111 Cummington Street
Boston, MA 02215
Phone: (617)353-8919
E-mail: hss@cs.bu.edu
Note
This is the technical report version of my Ph.D. thesis. To save a few trees it uses a
smaller font than the thesis and it is single-spaced. The abstract appears on page iv.
BOSTON UNIVERSITY
GRADUATE SCHOOL
Dissertation
MERMERA:
NON-COHERENT DISTRIBUTED SHARED MEMORY
FOR PARALLEL COMPUTING
by
HIMANSHU SHEKHAR SINHA
B. Tech., Indian Institute of Technology, Kharagpur, 1986
M.A., Boston University, 1989
Submitted in partial fulllment of the
requirements for the degree of
Doctor of Philosophy
1993
c Copyright by
HIMANSHU SHEKHAR SINHA
1993
Acknowledgements
Several people inuenced this work directly or indirectly. I am deeply indebted to my
advisor, Abdelsalam Heddaya, for his continuous support and encouragement. He was
involved at each and every step of this research. I would like to thank Azer Bestavros and
Steven Homer for their comments and suggestions. I also thank Joyce Friedman and Sharon
Salveter for serving on my committee.
I would like to express my gratitude to the graduate student community of the Computer
Science department at Boston University. Bob Carter, Nick Roosevelt and Marwan Shaban
gave their comments on earlier drafts of this thesis. Chris Lynch and S. Rajagopalan were
wonderful ocemates. The folks on Christopher Drive tolerated my intrusions into their
oces when I needed a break from the long hours at the terminal.
I thank Regina Blaney and Eileen Grabowski for their help in dealing with administrative
matters. Lou Henessy's help in my running the Distributed Computing Laboratory was
beyond his call of duty and I thank him for that.
Finally, I would like to thank my father, Sachchida Nand Sinha, and my mother, Manju
Rani Sinha, for their support and patience through this long endeavor.
iii
MERMERA:
NON-COHERENT DISTRIBUTED SHARED MEMORY
FOR PARALLEL COMPUTING
(Order No. )
HIMANSHU SHEKHAR SINHA
Boston University Graduate School, 1993
Major Professor: Abdelsalam Heddaya, Assistant Professor of Computer Science
Abstract
The proliferation of inexpensive workstations and networks has prompted several re-
searchers to use such distributed systems for parallel computing. Attempts have been made
to oer a shared-memory programming model on such distributed memory computers. Most
systems provide a shared-memory that is coherent in that all processes that use it agree on
the order of all memory events. This dissertation explores the possibility of a signicant
improvement in the performance of some applications when they use non-coherent memory.
First, a new formal model to describe existing non-coherent memories is developed. I
use this model to prove that certain problems can be solved using asynchronous iterative
algorithms on shared-memory in which the coherence constraints are substantially relaxed.
In the course of the development of the model I discovered a new type of non-coherent
behavior called Local Consistency.
Second, a programming model, Mermera, is proposed. It provides programmers with
a choice of hierarchically related non-coherent behaviors along with one coherent behavior.
Thus, one can trade-o the ease of programming with coherent memory for improved per-
formance with non-coherent memory. As an example, I present a program to solve a linear
system of equations using an asynchronous iterative algorithm. This program uses all the
behaviors oered by Mermera.
Third, I describe the implementation ofMermera on a BBN Buttery TC2000 and on
a network of workstations. The performance of a version of the equation solving program
that uses all the behaviors of Mermera is compared with that of a version that uses
coherent behavior only. For a system of 1000 equations the former exhibits at least a 5-fold
improvement in convergence time over the latter. The version using coherent behavior only
does not benet from employing more than one workstation to solve the problem while the
program using non-coherent behavior continues to achieve improved performance as the
number of workstations is increased from 1 to 6. This measurement corroborates our belief
that non-coherent shared memory can be a performance boon for some applications.
iv
Contents
Acknowledgements iii
Abstract iv
1 Introduction 1
1.1 Related Work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2
1.2 Outline of this Thesis : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5
2 A Formal Model of Shared Memory 6
2.1 Operations and Orderings : : : : : : : : : : : : : : : : : : : : : : : : : : : : 6
2.2 Coherent and Non-Coherent Memories : : : : : : : : : : : : : : : : : : : : : 8
2.2.1 Coherent Memories : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8
2.2.2 Non-Coherent Memories : : : : : : : : : : : : : : : : : : : : : : : : : 13
2.2.3 Relating Memories : : : : : : : : : : : : : : : : : : : : : : : : : : : : 19
2.3 Asynchronous Iterations on Slow Memory : : : : : : : : : : : : : : : : : : : 21
3 Mermera: A System that Combines Coherent and Non-Coherent Mem-
ories 24
3.1 Algorithms that Tolerate Non-coherence : : : : : : : : : : : : : : : : : : : : 24
3.2 Specication of Mermera : : : : : : : : : : : : : : : : : : : : : : : : : : : : 26
3.2.1 Informal Specication : : : : : : : : : : : : : : : : : : : : : : : : : : 26
3.2.2 Formal Specication : : : : : : : : : : : : : : : : : : : : : : : : : : : 27
3.3 Using Mermera : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 28
3.3.1 Solving a System of Equations : : : : : : : : : : : : : : : : : : : : : 29
3.3.2 Barrier Synchronization : : : : : : : : : : : : : : : : : : : : : : : : : 30
4 A Pilot Study on a BBN Buttery 31
4.1 Implementation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 31
4.1.1 Two phase locking (2PL) : : : : : : : : : : : : : : : : : : : : : : : : 32
4.1.2 Pipelined locking protocol (PLP) : : : : : : : : : : : : : : : : : : : : 32
4.1.3 PRAM algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 32
4.2 Performance Results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 32
4.2.1 Access Time : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 32
4.2.2 Solving a System of Equations : : : : : : : : : : : : : : : : : : : : : 35
v
5 A Prototype on a Network of Workstations 41
5.1 Isis Implementation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 41
5.2 Performance Results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 43
5.2.1 Access Time and Completion Time : : : : : : : : : : : : : : : : : : : 43
5.2.2 Solving a System of Equations : : : : : : : : : : : : : : : : : : : : : 49
5.3 Impact of Isis : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 54
6 Conclusion 56
6.1 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 56
6.2 Future Work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 57
Bibliography 59
vi
List of Figures
1.1 The message passing model : : : : : : : : : : : : : : : : : : : : : : : : : : : 3
1.2 The shared-memory model : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3
2.1 An example of R
ww
(). : : : : : : : : : : : : : : : : : : : : : : : : : : : 9
2.2 A Sequentially Consistent execution that is not Externally Consistent : : : 11
2.3 An execution that violates overwrite semantics of memory : : : : : : : : : : 14
2.4 R
w=w
() : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 15
2.5 R
r=w
() : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 16
2.6 A computation that is Causal but not Coherent. : : : : : : : : : : : : : : : 16
2.7 A PRAM computation that is not Causal. : : : : : : : : : : : : : : : : : : : 17
2.8 A Slow computation that is not PRAM. : : : : : : : : : : : : : : : : : : : : 18
2.9 A Locally Consistent computation that is not Slow. : : : : : : : : : : : : : : 18
2.10 An execution that is Weak but not Locally Consistent. : : : : : : : : : : : : 19
2.11 Hierarchy of Memories : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 20
2.12 Slow Memory is sucient for Totally Asynchronous Iterative Methods : : : 22
3.1 Hierarchy of Shared Memory. : : : : : : : : : : : : : : : : : : : : : : : : : : 25
3.2 Linear equation solver : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 29
3.3 Barrier Synchronization with Mermera : : : : : : : : : : : : : : : : : : : : 30
4.1 The two phase locking protocol. : : : : : : : : : : : : : : : : : : : : : : : : : 33
4.2 The Pipelined Locking Protocol. : : : : : : : : : : : : : : : : : : : : : : : : 34
4.3 The PRAM algorithm. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 34
4.4 Performance of Read operations. : : : : : : : : : : : : : : : : : : : : : : : : 36
4.5 Performance of Read and Write operations. : : : : : : : : : : : : : : : : : : 37
4.6 Performance of Write operations. : : : : : : : : : : : : : : : : : : : : : : : : 38
4.7 Performance of Solver on TC2000 : : : : : : : : : : : : : : : : : : : : : : : : 39
5.1 Structure of Buer : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 43
5.2 Pseudo C code for write operations. : : : : : : : : : : : : : : : : : : : : : : 44
5.3 Pseudo C code for broadcast() : : : : : : : : : : : : : : : : : : : : : : : : : 45
5.4 Update memory message handler : : : : : : : : : : : : : : : : : : : : : : : : 46
5.5 Pseudo C code for enabled Buer(). : : : : : : : : : : : : : : : : : : : : : : 47
5.6 Modied Solver : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 50
5.7 Convergence time vs. Number of processors : : : : : : : : : : : : : : : : : : 52
5.8 Eect of Buer size on Performance : : : : : : : : : : : : : : : : : : : : : : 53
vii
6.1 Non-coherent shared memory : : : : : : : : : : : : : : : : : : : : : : : : : : 56
viii
List of Tables
2.1 Summary of notation used in this thesis. : : : : : : : : : : : : : : : : : : : : 7
2.2 Summary of correctness conditions for dierent types of memories. : : : : : 19
5.1 Access Times : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 48
5.2 Completion Times : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 49
5.3 Solver's Performance : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 51
ix
Chapter 1
Introduction
With the proliferation of inexpensive workstations, the idea of using them collectively as
a parallel computer has gained widespread acceptance. As a result, several problems that
could only be solved on large supercomputers can now be solved on a collection of less pow-
erful processors. Some approaches [BBD
+
87, BDG
+
91] use the message passing paradigm
for inter-process communication. Others [LH89, MF89] use the shared-memory paradigm.
In the absence of shared-memory in hardware, the message passing paradigm is amenable
to ecient implementation. On the other hand, the shared-memory paradigm is a natural
extension of programming a uniprocessor machine. But Lipton and Sandberg [LS88] have
shown that in the worst case the access time of coherent
1
shared memory is proportional
to the worst case communication delay among processes. While the coherent shared mem-
ory model gives us ease of programming, its implementations suer from the performance
drawback mentioned above.
In this thesis we present a non-coherent shared memory programming model which tries
to capture the best of both worlds. We sacrice some ease of programming from the coherent
shared-memory model to get performance that is closer to the message passing model.
Several non-coherent memories have been proposed [LS88, WW90, HA90]. These mem-
ories are expected to perform better than coherent memory because of the weaker syn-
chronization requirements of the operations they provide. The weaker synchronization
requirements permit operations to be buered and communication and computation to be
overlapped. However, a uniform formal model in which these non-coherent memories can
be described has been lacking. We give a formalism based on partial orders on memory
events to describe these non-coherent memories. Although we did not set out to invent a
new non-coherent behavior, in the course of the development of our formalism we discovered
a behavior which we call local consistency. This condition should be satised by all shared-
memories. We use our formalism to prove that totally asynchronous iterative methods to
nd xed points can converge on Slow memory [HA90], a non-coherent memory.
We argue that programmers should be given the choice of several non-coherent behaviors,
thereby enabling them to trade o programming simplicity for better performance. Taking
1
We will formally dene coherence in Chapter 2. Intuitively, in coherent shared memory all processes
agree on the order of events on memory. Sequential Consistency [Lam79] is an example of coherent behavior
and in chapter 2 we show that our notion of coherence is equivalent to Sequential Consistency. Therefore,
we use the terms Coherence and Sequential Consistency synonymously.
1
2this into consideration we give the specication of our model, Mermera
2
, in which we
give the programmer a choice of several hierarchically related non-coherent behaviors and
one coherent behavior. We show how these dierent behaviors can be used in a program
by implementing a totally asynchronous iterative method for solving a system of linear
equations.
A description of the implementation and the performance ofMermera on two dierent
platforms
3
is also included in this thesis. To the best of our knowledge this is the rst im-
plementation of a system that combines coherence and non-coherence. The asynchronous
iterative algorithm mentioned above was programmed on our implementations and its per-
formance is reported. The rst implementation is a pilot study on a 45 node BBN Buttery
TC2000 which is a distributed memory machine where the time to access a remote location
is 3 times the time to access a local location ( 0:7seconds for a local access). The ac-
cess time
4
for non-coherent memory using full replication is at least an order of magnitude
smaller than the access time for the coherent memory implementation.
The second implementation is on a network of six workstations connected by an Eth-
ernet. Once again, full replication is used. The access time for non-coherent memory is
independent of the number of processes sharing memory, while it grows linearly in the
number of processors for coherent memory. We observe that non-coherent write operations
show a 40-fold improvement in completion time (dened in Chapter 5) over coherent writes.
The linear solver using Mermera performs at least 5 times faster on non-coherent memory
than on coherent memory. Moreover, the performance of the solver using coherent behavior
breaks down at six processors, i.e., it takes more time to solve the equations using six pro-
cessors than it does using ve processors. On the other hand, the solver using non-coherent
memory continues to show improvement in performance with up to six processors
5
. The co-
herent solver never performs better than a sequential solver using the Gauss-Seidel iterative
method while the non-coherent solver does.
1.1 Related Work
In the message passing programming model (Figure 1.1) each process can access data in
its address space using read/write operations. Processes communicate with each other by
sending and receiving messages using send/receive operations.
In the shared-memory programming model (Figure 1.2) all processes share the same
memory. Inter-process communication is achieved by read/write operations to memory.
In Figures 1.1 and 1.2 the P's and the M's denote processes and their address spaces
respectively.
We use the term distributed memory machine for hardware that consists of several pro-
cessors each with its own memory. The processor-memory pairs are connected by some
network. Our goal is to give a fast shared-memory interface to a distributed memory ma-
chine.
2
Mer is the Latin root for memory derived from the Sanskrit root Smar. Mermeros is an ancient Greek
name meaning care laden.
3
An implementation on a Connection Machine is in progress.
4
The terms access time and completion time will be dened in Chapters 4 and 5.
5
Due to limited equipment we could not measure performance on more than six processors.
3M
PP
M
MPP MMP
P M MP P M. .
. .
Network
i
j j
i
Send(P   )
Receive(P   )
j
i
Figure 1.1: The message passing model.
. . .
...
PP
P
iP
PP
P P
j
Shared Memory
Figure 1.2: The shared-memory model.
4The rst implementation of the shared-memory model on a distributed system is de-
scribed in [Li86] and elaborated in [LH89]. It guarantees dynamic atomicity (cf. Chapter 2)
which is a stronger condition than sequential consistency [Lam79]. It also uses a page of
memory as a unit of data transfer, which can lead to false sharing that can further degrade
its performance. The distributed shared memory is implemented in libraries linked to the
user's program. Mirage [FP89] is another system that provides coherent shared memory. It
puts the implementation of distributed shared memory in the kernel.
The rst implementation of Mether [MF89] is also a page-based system that guarantees
dynamic atomicity. Orca [BKT92] provides coherent shared objects. Their implementation
uses full replication, as does ours.
The Pipelined Random Access Memory (PRAM) was proposed by Lipton and Sand-
berg [LS88]. It guarantees that writes made by the same process are read by all processes
in the order they were written by the writer. Writes by dierent processes can be inter-
leaved in dierent orders at dierent processes. A hardware implementation of PRAM is
described in [San90, Ser90].
Hutto and Ahamad proposed Causalmemory, Slowmemory andWeakmemory in [HA90].
Causal memory respects the potential causality [Lam78] of memory operations. Slow mem-
ory only guarantees that writes by the same process to the same location are read by all
processes in the order they were written by the writer. Writes to dierent locations, even
by the same process, can be interleaved in dierent orders at dierent processes. A design
for a Causal memory system is described in [AHJ91] and a formal specication is given
in [ABHN91]. We do not know of an implementation of any of these memories.
In Chapter 3 we argue for the combination of dierent behaviors in one system. Multi-
Version memory [WW90] combines the behaviors of Weak memory and Dynamic Atomic
memory. An implementation of it was used to speed up a parallel B-Tree algorithm. Later
implementations of Mether [MF90, Min91] allow processes to read inconsistent data and
the programmer can choose to enforce consistency at any point in the program. The be-
havior permitted is very similar to that permitted by Multi-Version memory. Attiya and
Friedman [AF92] present a correctness condition for a system that supports one coherent
behavior and one non-coherent behavior. Their formalism is dierent from ours and our
specication allows a hierarchy of non-coherent behavior.
The architecture community has also moved in the direction of relaxing consistency
constraints to address the high latency of sequentially consistent memory. Processor con-
sistency [Goo91] is similar to PRAM above with the added condition that all writes to
the same location be totally ordered. The PLUS system [BR90] implements processor con-
sistency. Dubois et al [DSB86] proposed the idea of relating the ordering of events in the
memory to synchronization points in the program. This requires the program to distinguish
between ordinary and synchronizing accesses to the memory. Extensions of this idea can be
found in [AH90, GLL
+
90]. Release Consistency, dened in [GLL
+
90] has been implemented
in the DASH multiprocessor [LLG
+
92]. Munin [CBZ91] implements release consistency in
software on a network of workstations. Lazy release consistency [KCZ92] is an algorithm
for implementing release consistency. Midway [BZ91] uses entry consistency in which all
shared data is associated with synchronization variables. Here also, the goal is to reduce
the amount of communication by propagating data associated with certain synchronization
variables.
5All these approaches dier from our approach in that they require the programmer to
dierentiate between synchronization accesses and other accesses. We achieve synchroniza-
tion through ordinary operations on the memory. Our approach is especially suited for
asynchronous algorithms [BT89] that have very weak synchronization requirements.
1.2 Outline of this Thesis
In Chapter 2 we present our formalism that describes the dierent non-coherent behaviors
proposed in the literature. It is based on partial orders on memory events. We formally
conrm the known hierarchy among the dierent types of behaviors. Using this formalism
we prove that Slow memory, one of the weakest behaviors in this hierarchy, is sucient for
the convergence of totally asynchronous iterative methods [BT89].
In Chapter 3 we argue for the inclusion of dierent types of behavior in the same system.
We extend our formalism to specify the behavior of Mermera, a system that combines
coherence and non-coherence. Our programming model is presented in this chapter. We
give a program to solve a linear system of equations using an asynchronous iterative method.
This program uses the dierent behaviors oered by Mermera.
Chapter 4 describes the implementation and performance of Mermera on a 45 node
BBN TC2000. This was a pilot implementation that allowed us to estimate the performance
improvement that could be obtained from non-coherent behavior.
An implementation of Mermera on a network of Sparcstation 1+s is described in
Chapter 5. We use the Isis toolkit [Bir91] for this implementation. We report on the
performance of this implementation under a synthetic memory reference pattern and also
under the linear equation solver described in Chapter 3. Finally, in Chapter 6 we discuss
our conclusions and state some directions for future work.
Chapter 2
A Formal Model of Shared
Memory
In this chapter we present a formal model of shared memory based on the order of events
allowed on the shared memory. We describe several existing memory behaviors using this
model. Our formalism enables us to dene these behaviors precisely in one model, thereby
making comparisons among these behaviors straightforward. Earlier descriptions of these
behaviors were in terms of algorithms that implemented these memories, making a formal
comparison hard. In the course of our analyses, we also discovered a new type of non-
coherent behavior which we call Local Consistency (Section 2.2.2).
Section 2.1 introduces the notation we use to describe our model. Section 2.2.1 oers a
formal denition of Coherence and shows how it compares with Sequential Consistency|a
very commonly used notion of coherence [Lam79]. We prove that both notions are equiv-
alent. Section 2.2.2 formally denes a variety of known non-coherent memories. The rela-
tionship between these memories is explored in Section 2.2.3 and a hierarchy is established
among them. Finally, in Section 2.3 we use our formalism to prove that Slow memory, one
of the weaker memories in the hierarchy, is sucient for the convergence of the class of
totally asynchronous iterative algorithms.
2.1 Operations and Orderings
We base our model on memory events (i.e., executions of memory operations) rather than
on memory states, since the latter are only observable through the former. We describe the
correctness conditions of the various kinds of non-coherent memories in terms of the event
orderings that they allow.
In this section, we dene the essential components that constitute a computation. These
components are general enough to enable the formal description of non-coherent memory
behavior. Our model consists of a set of processes P sharing a set of memory locations L,
by executing operations on them. In general, we denote an operation execution by x:o(a):r,
where x is the name of the memory location, o stands for the operation (e.g., read or write),
and a; r represent the lists of arguments and results, respectively. An operation execution
has an invocation part x:o(a) and a return part x:o:r.
6
7Notation Meaning
x:r
i
:v A read event of location x, returning v and uniquely identied
by i. We use i to refer to the event's process.
x:w
i
(v) A write event that writes the value v to location x.
;  : : : Event variables.

i
The set of all events in process i.
 The set of all events in a computation.  =
[
i

i
R

A relation generally on  and generally not transitive.
R
+

irreexive transitive closure of R

R
i
Process program ordering imposed by i's program on its
events, 
i
. Must be a partial order.
R
wr
Writes-to ordering between write events and the read events
that read their values (def. 2.1).
R Global program ordering, the union of all process program
orderings and the writes-to ordering. R =
S
i
R
i
[R
wr
.
R

Subset of R that relates only events in   .

w
write-closure of , contains  and all write events which are
read by events in  (def. 2.5).
R
w

R

[ f(; )j( 2 ) ^ (R
wr
)g, the write-closure of R

Table 2.1: Summary of notation used in this thesis.
An event is a particular operation execution on behalf of a particular process. Events
are related to each other by various orderings, such as the ordering induced by each process'
program, or the ordering induced by information ow between processes. These orderings
are dened as needed throughout this chapter. We denote an event by x:o
i
(a):r, where i is
a unique identier for the event. With a slight abuse of notation, we will use i to represent
the process on behalf of which the event is executed.
1
In this thesis, we restrict our attention to shared memory that supports only read
and write operations.
2
Thus, operation executions can take only the forms of x:r():v, or
x:w(v):OK. For brevity, we will henceforth write x:r:v and x:w(v).
A process i is a partially ordered set (POSET) of events h
i
;R
i
i, where R
i
is an irreex-
ive partial ordering induced by process i's program. The set of all events  is given by
[
i

i
.
A glossary of the notation used in this thesis is shown in table 2.1.
To simplify our presentation, we assume, without loss of generality, that
1
The process identier is typically encoded in the event identier. For example, a unique identier for
events belonging to sequential processes can be constructed by appending an event sequence number to the
process identier.
2
It should be interesting to study shared memory that supports objects with arbitrary operations (e.g.,
incr/decr, enque/deque), for we can consider the variety of message passing models to be special cases of
shared memory in which the locations are queues with various ordering and composition properties.
8Assumption 2.1 Every value v can be written to each memory location at most once,i.e.,
x:w
i
(v
1
) 6= x:w
j
(v
2
) =) v
1
6= v
2
:
This can be easily enforced by appending a logical timestamp to the values written.
The writes-to ordering
3
is the last component we need to dene our notion of compu-
tation. This ordering relates a write event to every other read event that returns its value.
Formally,
Denition 2.1 R
wr
  is the minimal relation such that, 8v, for every pair of write
and read events in , x:w
i
(v)R
wr
x:r
j
:v.
We can now dene a computation as follows.
Denition 2.2 A computation E is a pair h;Ri.
Denition 2.3 We say that a computation is well-formed if for every read event  2 
there exists a write event  2  such that the value read by  is written by , i.e., R
wr
.
In this thesis we deal with well-formed computations only.
In the following sections we use the notions of events, program ordering and writes-to
ordering dened above, to construct other ordering relations. By constraining these ordering
relations in various ways, we arrive at precise formal denitions of a variety of non-coherent
shared memory behaviors that have been proposed. We contrast our denitions with those
oered in the literature.
2.2 Coherent and Non-Coherent Memories
2.2.1 Coherent Memories
In this section we describe executions that we consider to be Coherent. Intuitively, an
execution is Coherent if all processes agree on an ordering of all the events in the execution.
After dening coherence formally we will discuss other notions of coherence and relate our
notion of coherence to them.
We rst dene a projection operator R() that projects pairs from R.
Denition 2.4 8  ;R() = f(;  2 )j(R)g.
For notational convenience we sometimes use R

to denote R().
We say that a relation constructor R is monotonic i 8  	;R()  R(	).
Observation 1 The projection operator R(:) is monotonic, i.e., if   	   then R


R
	
which in turn implies R
+

 R
+
	
.
Denition 2.5 
w
is the write-closure of    and it is dened as

w
=  [ fj(R
wr
) ^ ( 2 )g
3
Inspired by the reads-from ordering used in dening view serializability of database transactions [Pap86].
9ww
R x.r
x.w
x.w
R
wr
+
R
Legendγ
β
α
Figure 2.1: An example of R
ww
().
Observation 2 If   	   then 
w
 	
w
.
Denition 2.6 R
w

= R

[ f(; )j(R
wr
) ^ ( 2 )g.
R
w

is dierent from R

w
in that the former does not include any pair consisting of two
writes that do not belong to . For example, consider two writes by process i, ;  2 
w
such that ;  62  and R
i
. In this case (; ) 2 R

w
but not in R
w

.
We dene a relation constructor R
ww
(), that takes a    and returns a set of
ordered pairs of distinct write events from 
w
that write to the same location. The relation
returned captures our notion of \observing" write events. An event  is observed by
another event,  i R
+
. R
ww
() returns ordered pairs of write events according to the
order they are observed.
Denition 2.7 R
ww
() is dened on 
w
as ( = x:w
i
(v
1
))R
ww
()( = x:w
j
(v
2
)) i
9(( = x:r
k
:v
2
) 2 )j (R
+

w
)^ (R
wr
).
This relation orders two writes to the same location in the order they are observed by
process k. Note that not all pairs of write events to a location will be ordered by this
relation. R
ww
is illustrated in Figure 2.1.
Lemma 2.1 R
ww
(:) is monotonic, i.e., if   	   then R
ww
()  R
ww
(	)
Proof: Let (; ) 2 R
ww
().
(; ) 2 R
ww
()
) 9 2 j(R
+

w
)^ (R
wr
)^ (;  2 
w
) (Def. 2.7)
) 9 2 	j(R
+

w
)^ (R
wr
)^ (;  2 
w
) (Because   	)
) 9 2 	j(R
+
	
w
)^ (R
wr
)^ (;  2 	
w
) (Observations 1, 2)
) (; ) 2 R
ww
(	) (Def. 2.7)
Denition 2.8 A computation E = h;Ri is Coherent i h; (R[R
ww
())
+
i is a partial
order. R
Coherent
= h; (R[R
ww
())
+
i. The set of all Coherent computations is denoted
by CO.
10
There are other notions of coherence prevalent in the literature. A hierarchy of Coherent
memories is presented in [HA90]. The most commonly dened notion is that of Sequential
Consistency. An execution is Sequentially Consistent if its results were the same as if the
operations of all the processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order specied by its program
[Lam79]. We dene the result of an execution to be the values read by the read events. We
refer to the set of Sequentially Consistent computations by SC. However, shared memory
implementations like [LH89] and coherent caches in shared memory multiprocessors imple-
ment stronger conditions like static atomicity (SA) and dynamic atomicity (DA) [HA90]. In
static atomic executions operations take eect for all observers at some specic component
event. In dynamic atomic executions operations take eect at any point (in absolute time)
during the operation interval as long as the resulting history is equivalent to some serial
execution. External consistency (EC) of an execution E requires the existence of a sequen-
tial order as in SC but this order is restricted to preserve the order of non-overlapping (in
absolute time) operations in E. It diers from DA in that an operation can take eect
outside the interval (in absolute time) dened by its invocation and its return. The dier-
ence between SC and EC is illustrated in Figure 2.2. External Consistency can be used to
guarantee progress of information ow between processes but this is not true for Sequential
Consistency because a process can continue to read only those values that were written
by itself and make progress in executing its program without ever reading values written
by other processes. It is possible to construct an equivalent serial execution given such an
execution.
The description of Sequential Consistency above allows a process to read values that
have not yet (again in absolute time) been written. Hutto and Ahamad dene realizable
Sequential Consistency that excludes such executions.
We observe a relationship between serializability theory [BHG87, Pap86] of transactions
and some of these notions of coherence. Among the various notions of serializability are
nal state serializability (FSR), view serializability (VSR) and conict serializability (CSR).
The two phase locking (2PL) protocol allows a subset of the executions allowed by CSR.
If we treat each read or write operation on memory as a transaction then 2PL allows only
static atomic executions with the release of a lock being the component event at which
an operation takes eect. The notion of External Consistency is the same as that of strict
serializability in [Pap86]. VSR in conjunction with Local Consistency (LC) (Def 2.15 below)
is the same as Sequential Consistency.
We will now dene Sequential Consistency in our notation.
Denition 2.9 A computation E = h;Ri is Sequentially Consistent i
1. 9 a total order R
s
over ; such that h;
S
i
R
i
[R
s
i is a partial order and
2. 8x:w(v
1
)R
wr
x:r:v
1
; x:w(v
1
)R
s
x:r:v
1
^ 6 9x:w(v
2
); such that
x:w(v
1
)R
s
x:w(v
2
)R
s
(x:r:v
1
).
The set of all Sequentially Consistent computations is denoted by SC.
The rst condition states that the sequence dened by R
s
agrees with the ordering of
events specied by each process' program. Since R
s
and R are dened over the same set of
11
Denotes the interval, i.e., the invocation and return
Process 2 :
Process 1:
x.w(1)
Absolute Time:
x.r:1
x.w(2) x.r:2
of an operation
Figure 2.2: An execution that is Sequentially Consistent but not Externally Consistent. It
is in SC because both, process 1 and process 2 can agree on the following order:
x:w(1); x:r:1; x:w(2); x:r:2. This order is consistent with the order specied by the programs
of both processes and each read operation returns the value written by the last write in the
order shown. However this execution is not Externally Consistent because a total order
that respects the order imposed by the duration of the events in absolute time does not
exist.
12
events, the values read by corresponding read events in the computations E = h;Ri and
E
0
= h;R
s
i are the same. The second condition states that a read operation returns the
value written by the most recent (according to R
s
) write in E
0
.
We relate the notion of Sequential Consistency to our denition of Coherence by proving
that they are equivalent notions, i.e., they allow the same computations.
Lemma 2.2 Any computation, hE;Ri, that is Sequentially Consistent (2 SC) is also Co-
herent (2 CO), i.e., SC  CO.
Proof: We prove our lemma by proving its contra-positive, i.e., by proving that any
computation that is not Coherent is not Sequentially Consistent which in turn is proved by
contradiction.
Let us assume that, given a non-coherent computation E = h;Ri, there exists a total
order R
s
on  that satises the conditions of Sequential Consistency.
For any computation (Coherent or not) the following observations can be made. From
conditions 1 and 2 of the denition of SC,
(; ) 2 R
+
) (; ) 2 R
s
(2:1)
From condition 2 of the denition of SC, and the denition of R
ww
(:) (def. 2.7),
(; ) 2 R
ww
()) (; ) 2 R
s
(2:2)
This is because R
ww
() requires the existence of a read  that observes  but reads .
This implies that R
+
 which from predicate 2.1 implies R
s
. Similarly, R
wr
 ) R
s
.
This establishes that both  and  occur before  in the total order dened by R
s
. From
the second condition of the denition of Sequential Consistency R
s
 is disallowed because
it would mean that  did not read the last (in R
s
) write to x, where x is the location that
; ;  operate on. Since R
s
is a total order and (; ) 62 R
s
, (; ) 2 R
s
. Therefore,
(; ) 2 (R [R
ww
())
+
) (; ) 2 R
s
(2:3)
From the denition of coherence we can conclude that if a computation is non-coherent
then h; (R [ R
ww
())
+
i is not a partial order which implies that there exists a pair of
events  and  such that both (; ) and (; ) 2 h; (R[R
ww
())
+
i.
From predicate 2.3 above, this implies that (; ) 2 R
s
^ (; ) 2 R
s
which violates our
assumption that R
s
is a total order.
Therefore, if a computation is not Coherent then a total order satisfying the conditions
of SC does not exist and a non-coherent computation is not Sequentially Consistent.
To prove the converse, CO  SC, we rst state the following lemma which can be
proven easily by contradiction.
Lemma 2.3 Let (S;R) be a partial order, i.e., R is a partial order on some set S. If
p; q 2 S and p and q are not related by R then (S; (R[ (p; q))) is a partial order.
Lemma 2.4 Any computation that is Coherent is Sequentially Consistent, i.e.,
CO  SC
4
.
4
We are grateful to Trung Dung [Dun92] for this proof.
13
Proof: We prove this by construction. Given a Coherent computation E = h;Ri, we
will construct a total order R
s
on  that satises the two conditions stated in denition 2.9.
The steps of the construction are:
1. Let R
0
s
= (R [R
ww
())
+
, and i = 0.
2. While 9(a = x:w(v
1
)); (b = x:r:v
1
); (c = x:w(v
2
)) 2  s.t. aR
wr
b and b and c are
not related in R
i
s
, do
(a) R
i+1
s
 (R
i
s
[ (b; c))
+
.
(b) i i+ 1.
3. While 9a; b 2 , s.t. a and b are not related in R
i
s
do
(a) R
i+1
s
 (R
i
s
[ (a; b))
+
.
(b) i i+ 1.
This construction will terminate because  is nite. Let R
s
= R
i
s
, i.e., the relation at
the end of the construction. R
s
is a partial order because both steps 2 and 3 satisfy the
conditions of lemma 2.3. Furthermore, R
s
is a total order because step 3 will terminate
only after all pairs in  have been related.
We will now show that this R
s
satises the two conditions required of the total order
R
s
in denition 2.9.
1. The rst condition of denition 2.9 states that (
S
i
R
i
[R
s
) is a partial order. This
condition is satised because we started with R
0
s
= (R [ R
ww
())
+
, which implies
that
S
R
i
 R
s
.
2. The second condition of denition 2.9 states that
8x:w(v
1
)R
wr
x:r:v
1
6 9x:w(v
2
)jx:w(v
1
)R
s
x:w(v
2
)R
s
(x:r:v
1
).
Let us assume by way of contradiction that 9(a = x:w(v
1
)); (b = x:r:v
1
); (c =
x:w(v
2
)) 2  s.t. aR
s
cR
s
b. Note that aR
wr
b. There are three cases:
(a) (c; b) was inserted in R
s
in step 1 of the construction above. Then (c; a) 2 R
ww
(denition 2.7), which means that (c; a) 2 R
s
(step 1). But this precludes
(a; c) 2 R
s
which contradicts our assumption that (a; c) 2 R
s
.
(b) (c; b) was inserted in step 2. This is not possible because if (c; b) were not related
before step 2 then it would be (b; c) that would be added not (c; b).
(c) Before step 3 (b; c) would be related so (c; b) cannot be related in step 3.
Theorem 2.1 From lemmas 2.2 and 2.4, CO = SC.
2.2.2 Non-Coherent Memories
In this section we discuss computations that violate our denition of Coherence (CO). We
consider computations allowed by Causal [HA90] memory, pipelined random access mem-
ory (PRAM) [LS88], Slow and Weak memories [HA90] and Multiversion memory (MVM)
[WW90]. Our correctness conditions are dened in terms of relations that we construct
on a given computation. None of these relations require information that is not already
available in a computation denoted by E = h;Ri.
14
x.rx.r
x.w
x.rx.w wr
R
R
+
γ
δ
β
α
j
i iR
Legend
i
Figure 2.3: Execution that violates overwrite semantics of memory.  and  are not related
by potential causality but the value that  reads has been overwritten (when  read the
value written by ).
Causal Memory
In [HA90] Hutto and Ahamad dene Causal memory to be a memory \ : : : that requires all
processors to agree on the order of causally related eects (writes) but allows events not
related by potential causality (concurrent events) to be observed in diering orders."
The notion of potential causality is derived from Lamport's denition [Lam78] of the
term for message passing systems. In [HA90] a write event is related to a message-send and
a read event is related to a message-receive.
This interpretation has two problems. First, it fails to capture the overwrite semantics
of memory and permits the computation shown in Figure 2.3 that should not be permitted.
Second, the notion of an order of observation of events is not clear. A more precise
denition of Causal memory is presented by Ahamad et al. in [AHJ90].
To give a clear denition of a Causal computation we dene two relation constructors
R
w=w
and R
r=w
which take a set of events    and return ordered pairs from 
w
.
Intuitively, R
w=w
() relates two distinct writes  and  to the same location if and only
if  follows  in R
+

but the value written by  is read by a read event that follows  (again
in R
+

). It captures the notion of a write event overwriting the value written by previous
(in R
+

) writes.
Denition 2.10 R
w=w
() = f( = x:w:(v
1
)); ( = x:w:(v
2
))j(;  2 )
^ 9 = x:r:v
2
2 jR
+

R
+

g
15
R
wr
+
R
Legend
x.w
x.w
x.r
γ
α
β
w/wR
Figure 2.4: R
w=w
()
Note that R
wr
. It is illustrated in Figure 2.4.
Lemma 2.5 With ; ;  as above, (; ) 2 R
w=w
()) R
ww
().
This follows directly from the denition of R
ww
. Note that the converse is not necessarily
true.
R
r=w
, illustrated in Figure 2.5 captures a dierent kind of overwrite semantics than
R
w=w
. While R
w=w
relate writes when one of them overwrites the other because of potential
causality, R
r=w
relates a read with a preceding
5
write that is subsequently read.
Denition 2.11 R
r=w
() = f( = x:r:v
1
;  = x:w(v
2
))j( 2 ;  2 
w
)
^ 9(( = x:w(v
1
) 2 
w
) ^ ( = x:r:v
2
)j((R
w

)
+
R
+

))g
Note that R
wr
 and R
wr
.
Lemma 2.6 With ; ; ;  as above, (; ) 2 R
r=w
()) (R
ww
())^ (R
ww
()).
Lemma 2.7 R
w=w
and R
r=w
are monotonic.
Denition 2.12 R
Causal
= R
+
[R
r=w
()[R
w=w
(). A computation is Causal i
h;R
Causal
i is a partial order.
In Figure 2.6 we give an example of a computation that is Causal but not Coherent
(def. 2.8). It also shows the pairs of events that cause h; (R[R
ww
)
+
i to not be a partial
order. A memory is Causal i it permits Causal computations only.
5
Precede and subsequent refer to the order imposed by potential causality.
16
x.r
x.r
x.w
x.w
R
wr
+
R
Legend
δ
γ
β
α
R
r/w
Figure 2.5: R
r=w
()
x.rx.r
x.rx.r
x.w
x.w
R
wr
i
k
iR
ww
R
Legend
i
Figure 2.6: A computation that is Causal but not Coherent.
17
j β
α
wr
R
x.wy.r
y.wx.w i
Legend
Ri
i
j
k
x.r
x.r
x.r
Figure 2.7: A computation that is allowed by PRAM but is not Causal.  and  are causally
related but process k observes them in an order that violates this causality.
PRAM
Pipelined Random Access Memory (PRAM) is proposed in [LS88]. In a PRAM system each
process has a copy of every shared location. A process always reads the local copy. It writes
to the local copy and broadcasts the value written to all processes. The writing process
does not wait for the broadcast to complete. When a process receives the value broadcast
it updates its local copy.
To specify the correctness conditions for PRAM computations in terms of ordering of
events that are allowed, we use the following relations. Let 
ij
= 
i
[
j
.
Denition 2.13 R
PRAM
(
ij
) = (R
w

)
+
[R
w=w
()[R
r=w
(
i
)), where  = 
ij
. A com-
putation is a PRAM computation i 8i; j 2 P; the set of all processes; h
w
ij
;R
PRAM
(
ij
)i
is a partial order.
A computation that is a PRAM computation but is not a Causal computation is shown
in Figure 2.7. A memory is PRAM i it permits only PRAM computations.
Slow Memory
Hutto and Ahamad [HA90] dene Slow memory as follows: \: : : reads must return some
value that has been previously written to the location being read. Once a value has been
read, no earlier writes to that location (by the processor that wrote the value read earlier)
can be returned. Local writes must be immediately visible. : : :".
To describe the computations allowed by Slow memory we dene 
ijx
the set of events
pertaining to location x and belonging to process i and process j, i.e.,

ijx
= f 2 
ij
j = x:og.
18
x.r
x.w
R
wrk
i
x.w y.w
y.r
i
i
Legend
R i
Figure 2.8: A computation that is Slow but is not allowed by PRAM.
R
wr
iR
Legend
i
x.r
k
x.r
x.wx.w
i
Figure 2.9: An execution that is Locally Consistent but not Slow.
Denition 2.14 R
Slow
(
ijx
) = (R
w

)
+
[ R
w=w
() [ R
r=w
(
ix
), where  = 
ijx
. A
computation is a Slow computation i (8i; j 2 P ^ 8x 2 Lh
w
ijx
;R
Slow
(
ijx
)i is a partial
order.
A computation that is Slow but not PRAM is shown in Figure 2.8. A memory is Slow
i it allows Slow computations only.
Locally Consistent Memory
In this section we dene a new condition called Local Consistency (LC). This condition
requires that each process observe events as if they were executed on a single processor.
This is dierent from Sequential Consistency (SC, def 2.9) in that SC requires all events to
occur as if they were executed on one processor while LC requires that only events observed
by a process appear to it as if they were executed on a single processor. Dierent processes
may observe the same events in dierent orders.
Denition 2.15 R
LC
(
ix
) = (R
w

)
+
[R
w=w
(
w
)[R
r=w
()), where  = 
ix
. A compu-
tation is Locally Consistent i 8i; xh
w
ix
;R
LC
(
ix
)i is a partial order.
A computation that is Locally Consistent but that is not Slow is shown in gure 2.9.
A memory is Locally Consistent i it permits Locally Consistent computations only.
Weak Memory
The last type of computation we describe is allowed byWeak memory of [HA90]. It requires
that a read return any value that is previously written by a write operation. The notion of
previous is not dened.
Denition 2.16 A computation is Weak i 8( 2 j:o = read)9( 2 )jR
wr
.
19
R
wr
iR
Legend
ix.rx.rx.w i x.w i i
Figure 2.10: An computation that is Weak but not Locally Consistent.
Memory Type Partial Order Relation
Coherent h; (R[R
ww
())
+
i
Causal h; (R
+
[R
w=w
()[R
r=w
())i
PRAM 8i; j; (h
w
ij
; ((R
w

ij
)
+
[R
w=w
(
ij
) [R
r=w
(
i
))i).
Slow 8i; j; x(h
w
ijx
; ((R
w

ijx
)
+
[R
w=w
(
ijx
) [R
r=w
(
ix
))i)
LC 8i; x(h
w
ix
; ((R
w

ix
)
+
[R
w=w
(
w
ix
) [R
r=w
(
ix
))i)
Table 2.2: Summary of correctness conditions for dierent types of memories.
This is the same condition as our well-formedness condition (def. 2.3). An execution that is
Weak but not Locally Consistent is shown in Figure 2.10. A memory is Weak i it permits
Weak computations only.
2.2.3 Relating Memories
In this section we relate the various kinds of memory we have discussed so far in this
chapter. For ease of reference we summarize the correctness conditions of the dierent
types of memories in table 2.2.
We establish parts of the hierarchy shown in Figure 2.11. In the gure, an A  B implies
that the set of computations allowed byA is a strict subset of the set of computations allowed
by B.
Theorem 2.2 As shown in Figure 2.11,
CO  Causal  PRAM  Slow  LC  Weak
CO, Causal, PRAM , Slow, LC and Weak each represent the set of all computations
allowed by the memory behavior they name.
Proof: From lemma 2.5 and lemma 2.6 we can conclude that if R
Causal
is not a partial
order then R
Coherent
is not a partial order. Therefore any computation that is not Causal
cannot be Coherent (i.e., a Coherent computation is Causal). We showed a computation
that is Causal but not Coherent in Figure 2.6. Therefore, CO  Causal.
From the monotonicity property of R
w=w
and R
r=w
(lemma 2.7) it is obvious that if
R
PRAM
(
ij
) is not a partial order for some i; j then R
Causal
cannot be a partial order
because 
ij
 . Therefore any computation that is not a PRAM computation cannot
20
||
Coherent
Non-coherent
Coherent (CO)
Locally Consistent (LC)
U
U
U
U
U
U
Weak
Slow
Pipelined RAM (PRAM)
Causal
Sequentially Consistent (SC)
Dynamic Atomic (DA)
Figure 2.11: Hierarchy of Memories
21
be Causal (i.e., all Causal computations are PRAM computations too). In Figure 2.7 we
showed a PRAM computation that is not Causal. Therefore, Causal  PRAM .
Since 
ijx
 
ij
we can conclude from the same monotonicity property as above that
if R
Slow
(
ijx
) is not a partial order then R
pram
(
ij
) is not a partial order. Therefore
a computation that is not Slow is not a PRAM computation either. In other words all
PRAM computations are Slow computations. As shown in Figure 2.8, the converse is not
true. Therefore PRAM  Slow.
The argument to show that a Locally Consistent computation must be Slow is slightly
dierent. The rst and third terms in the denition of R
LC
(
ix
) are subsets of the cor-
responding terms of R
Slow
(
ijx
). The second term in R
LC
(
ix
) is R
w=w
(
w
ix
) while the
second term in R
Slow
(
ijx
) is R
w=w
(
ijx
). If there exists (; ) 2 R
w=w
(
w
ix
) such that
(R
w

ix
)
+
[R
w=w
(
w
ix
) is not a partial order (implying that the computation is not Locally
Consistent) then there are two possibilities. We show that in both cases there will be a j
such that (
w
ijx
;R
Slow
(
ijx
)) will not be a partial order.
1. ;  2 
ix
in which case (; ) 2 R
w=w
(
ijx
), or
2.  2 
ix
;  2 (
w
ix
  
ix
) in which case, 9jj(R
w

ijx
)
+
[ R
w=w
(
ijx
) is not a partial
order. The j is the process to which  belongs.
Therefore, a computation that is not Locally Consistent is not Slow either.
2.3 Asynchronous Iterations on Slow Memory
We will now use the formalism that we developed in the last section to show that under
certain conditions Slow memory is sucient for the correctness of totally asynchronous
iterative algorithms summarized in [BT89, BT90]. In Section 3.3 we show how non-coherent
memory can be used to solve a system of linear equations.
Consider an asynchronous iterative algorithm that nds the solution to x f(x) where
x is a vector of length n. The i
th
component of x is denoted by x
i
. For simplicity we assume
that we have n processors and that the ith processor computes x
i
.
Let x
j
(t) be the value of x
j
at time t, where t is an integer variable used to index events
of interest in the system. T
i
is the set of times at which x
i
is updated in the following
manner.
x
i
(t + 1) =
(
f
i
(x
1
(
i
1
(t)); x
2
(
i
2
(t)); : : : ; x
n
(
i
n
(t))) 8t 2 T
i
x
i
(t) otherwise
where 0  
i
j
(t)  t. 
i
j
(t) is an integer that denotes the version of x
j
used by processor i
at time t to compute x
i
(t+ 1).
It has been shown in [BT90] that if f is a contraction mapping and the implementation
of the algorithm to solve the problem satises the total asynchronism assumption given
below then the iteration described above will converge. The total asynchronism assumption
states that
1. the sets T
i
are innite, and,
2. if ft
k
g is a sequence of T
i
which tends to innity then lim
k!1

i
j
(t
k
) =1.
22
)ijx(∆w/wR
2t 1t
t+1t
Process i:
Process j:
time
x j .w .wjx
x j .r .rjx
Figure 2.12: A computation in which 
i
j
(t) = t
1
> 
i
j
(t + 1). In this gure 
i
j
(t) = t
1
and

i
j
(t+ 1) = t
2
We will now show that under certain conditions the total asynchronism assumption is
satised by a system using Slow memory to store x.
The rst condition, that every set T
i
be innite must be satised by the program
implementing this iteration. An innite loop that repeatedly computes f
i
satises this
condition
6
.
To prove the second condition we rst show that,

i
j
(t+ 1)  
i
j
(t)
Assume, by way of contradiction, that there exist i; j and t such that

i
j
(t) > 
i
j
(t+ 1). Let 
i
j
(t) = t
1
and 
i
j
(t+ 1) = t
2
, (t
1
> t
2
). An execution that allows this
will contain the computation segment shown in Figure 2.12. This is not permitted by Slow
memory (def. 2.14) because the order imposed by R
w=w
(
ijx
) on the two writes by process
j to x
j
in Figure 2.12 creates a cycle in R
Slow
. Contradiction. Thus 
i
j
(t) is a monotonically
non-decreasing function. If an implementation of Slow memory guarantees progress, i.e., a
write by a processor, i, eventually reaches all other processors, or a subsequent
7
write to
that variable by i does, then 
i
j
(t) is guaranteed to increase. Since x
j
is written innitely
often (because of condition 1), lim
t!1

i
j
(t) =1.
A comprehensive list of xed point problems that converge in a totally asynchronous
iteration can be found in chapter 6 of [BT89]. As an example we will use this method to nd
the solution to a linear system of equations, x = Ax+ b. Such an iteration will converge if
6
We will deal with the termination of the iteration in Chapter3.
7
In R
i
.
23
the spectral radius
8
of jAj, (jAj) < 1 [Bau78]. Application of this method to optimization
problems, the shortest path problem, solution of dierential equations and network ow
problems is described in [BT89, BT90].
In the next chapter we motivate and propose a system that provides programmers
with a choice of non-coherent behaviors that are related by the hierarchy established in
theorem 2.2.
8
The spectral radius (A) of a square matrix A is dened as the maximum of the magnitudes of the
eigenvalues of A.
Chapter 3
MERMERA: A System that
Combines Coherent and
Non-Coherent Memories
In the last chapter we introduced a formal model for describing dierent non-coherent
behaviors and we established a hierarchy of non-coherent memories. In this chapter we
propose our system, Mermera, which gives programmers a choice of coherent and non-
coherent behaviors.
In Section 3.1 we argue for the inclusion of dierent types of behaviors in the pro-
gramming model. Section 3.2 species Mermera, a system that provides a combination
of the behaviors described in Chapter 2. Our formal model of Chapter 2 is extended in
Section 3.2.2 to describe the behavior of Mermera. Finally, in Section 3.3, we show how
the dierent behaviors provided by Mermera can be used in one program.
3.1 Algorithms that Tolerate Non-coherence
In the last chapter we established the hierarchy shown in Figure 3.1. In the gure, the subset
relationship implies that the set of computations allowed by one type of memory is a proper
subset of computations allowed by a memory higher in the hierarchy. The question that
arises is: Which of these memories should be provided to programmers? Memories higher in
the hierarchy have weaker ordering requirements making ecient implementations possible.
But, not all applications can run on the weaker memories. In the rest of this section we will
list the applications known to run on each of the memories in the hierarchy.
Weak memory would be the most ecient to implement but we are not aware of any
applications that can run on Weak memory alone. Multi-version Memory (MVM) [WW90]
is a memory in which the programmer has access to Weak memory as well as Dynamic
Atomic memory. MVM was used to implement B-trees.
In [HA90], Hutto and Ahamad show how a memory that handles exactly one operation
at a time (i.e., serially) can be simulated by distributed processors using Slow memory.
But we want to focus on classes of programs that can run directly on Slow memory so that
one can exploit its potential performance advantages. We showed in Section 2.3 that Slow
24
25
||
Coherent
Non-coherent
Coherent (CO)
Locally Consistent (LC)
U
U
U
U
U
U
Weak
Slow
Pipelined RAM (PRAM)
Causal
Sequentially Consistent (SC)
Dynamic Atomic (DA)
Figure 3.1: Hierarchy of Shared Memory.
Memory is sucient for the convergence of certain asynchronous iterative algorithms to
nd x points. But one would need a stronger behavior for detecting that the iteration has
converged. We believe that asynchronous iterative graph algorithms (e.g., Bellman Ford
shortest path algorithm) can also be implemented eciently on Slow memory.
Lipton and Sandberg show in [LS88] that PRAM can be used to solve a large num-
ber of applications like FFT, matrix-vector product, matrix-matrix product, dynamic pro-
gramming and other computations that are in the large class of oblivious computations
1
.
They also prove that in these computations, whenever the actual computation dominates
the synchronization
2
overhead, PRAM is much more ecient than Sequentially Consistent
memory.
The use of Causal memory to solve the traveling salesman problem, the dictionary
problem and to nd the solution of a system of linear equations is described in [AHJ90].
Hutto and Ahamad showed in [HA90] that one cannot achieve mutual exclusion using
Slow Memory. Attiya and Friedman [AF92] proved that any solution to the mutual exclusion
problem using non-coherent memory will either involve a centralized server or will require
the participation of all processes whether they want to enter the critical section or not.
This discussion illustrates that dierent problems are amenable to ecient solutions on
dierent kinds of memory. For this reason we propose that programmers be given a choice
of behaviors. We expect programmers to rst program using coherent behavior because of
their familiarity with it. They can then trade o the ease of programming for performance
by using non-coherent behavior where the program can tolerate it.
1
\A computation is oblivious if its data motion and the operations it executes at a given step are inde-
pendent of the actual values of data." [LS88]
2
Synchronization in the algorithm not in the memory.
26
3.2 Specication of Mermera
In this section we briey describe each of the memory behaviors supported byMermera and
explain how the behaviors are combined. Our system combines the behaviors of Coherent
Memory, Pipelined RAM, Slow Memory and Locally Consistent Memory. This choice is
rather arbitrary. Our test application could use all of these behaviors and these turned out
to be easy to implement with the tools we had. Our model consists of processes sharing
a region of their address space. These processes may be running on dierent processors.
We rst describe the behavior informally and then we extend our model of Chapter 2 to
describe Mermera.
3.2.1 Informal Specication
Programs perform Read and Write operations to the shared memory. We provide four kinds
of write operations: CO Write, PRAM Write, Slow Write and Local Write. Each of these
operations takes a location and a value as arguments. Values are read from shared memory
using the Read operation.
CO Write(loc, val): This operation provides the behavior specied by Coherent Memory
(Section 2.2.1), i.e., all CO Write are totally ordered.
PRAM Write(loc, val): This operation provides the behavior specied by PRAM (Sec-
tion 2.2.2). The order of all PRAM Writes by the same process is respected by all
processes, i.e., if a process performs two PRAM Writes, w
1
followed by w
2
, then no
process can read them in the reverse order. Writes by dierent processes may be
interleaved in dierent orders by dierent processes.
Slow Write(loc, val): This operation provides Slow Memory. All Slow Write by the same
process to the same location are ordered by all processes in the order they were written.
Slow Write to dierent locations by the same process may be ordered dierently by
dierent processes.
Local Write(loc, val): This operation makes val visible only to the process executing the
operation. It implements Local Consistency described in Section 2.2.2.
If a programmer uses write operations of only one kind then the programmer can assume
that Mermera provides him with only that type of memory. If a programmer chooses to
use more than one of the dierent kinds of write operations then the behavior is as follows.
There is a global total order in which CO Writes are observed by the dierent processes and
this order is consistent with each process' program and with the information ow through
weaker writes. All PRAM Writes and CO Writes by the same process are ordered by all
processes in the order they were specied in the writing process' program. All Slow Writes,
PRAM Writes and CO Writes satisfy the correctness condition of Slow memory described
above. Similarly, all Local Writes, Slow Writes, PRAM Writes and CO Writes satisfy Local
Consistency. This implies that if a process i reads a value written by process j then process
i must be aware of all previous (in R
j
) writes by j that are at least as strong as the write
that is read. This condition does not hold slow Slow Write as they may not be propagated
to all processes.
27
Liveness Requirements
We note that the above specication of Mermera does not impose any liveness conditions
on an implementation, i.e., it does not require that all writes be propagated to all pro-
cesses. We now impose the condition that all CO Writes and PRAM Writes be eventually
propagated
3
to all processes. We do not impose any such condition on Slow Writes. This is
because losing any Slow Write does not constrain future writes of any type. On the other
hand losing a PRAM Write will cause all subsequent writes (PRAM Write and stronger) to
be blocked because receiving any of the subsequent writes would imply that the receiving
process is aware of the write that was lost.
Performance Implications
Our specication requires a total order order on all CO Write events. This means that a
CO Write can return only after its position in the total order has been decided. Since any
process could be doing a write at any time, this decision would require some kind of global
consensus.
PRAM Writes have to be propagated eventually. This implies that messages used to
do the propagation have to be reliable. But they can be buered so that the cost of their
propagation can be amortized over several writes. Moreover, this also allows for overlapping
communication with computation.
Slow Writes do not require the messages to be reliable. The implementation can prop-
agate them on a best-eort basis giving no guarantees about their propagation. The im-
plementation of Slow Write does not have to suer any overhead for making the messages
reliable. Slow Writes can also benet from buering just as PRAM Writes can.
Local Writes require no communication at all.
3.2.2 Formal Specication
In this section we extend the formalism described in Chapter 2 to describe the hybrid
executions permitted by Mermera. This extension takes into account the fact that the
Coherent behavior is aected by the information that may ow through weaker writes. In
other words, applying the correctness conditions of Chapter 2 to projections of the dierent
types of writes in  is not sucient.
The write operations now have a type associated with them and if  is a write operation
then its type is referred to by :type. The type is a member of a totally ordered set
T = fco; ca; p; s; lcg. The total order hT;i is reexive ant it imposes the order co  ca 
p  s  lc. Where necessary, we show the type of a write by using a subscript. Thus the
write operations are w
co
, w
ca
, w
p
, w
s
, w
lc
and they represent Coherent, Causal, PRAM,
Slow and Locally Consistent writes respectively.
We modify the constructors R
ww
(Denition 2.7), R
w=w
(Denition 2.10) and R
r=w
(Denition 2.11) to use the type information. Each of them takes an additional parameter,
type 2 T .
3
We will not dwell on the exact mechanism of propagation in this chapter. It could be done using update
messages or invalidate messages. We will discuss update based protocols in chapters 4 and 5.
28
Denition 3.1 R
ww
(; type) is dened on 
w
as
( = x:w
i
(v
1
))R
ww
()( = x:w
j
(v
2
)) i 9(( = x:r
k
:v
2
) 2 ) such that (R
+

w
) ^
(R
wr
) ^ (:type  type) ^ (:type  type).
Note that R
+

w
may contain pairs due to writes weaker than type.
Denition 3.2 R
w=w
(; type) = f( = x:w(v
1
);  = x:w(v
2
))j(;  2 ) ^ (:type 
type) ^ (:type  type) ^ 9( = x:r:v
2
2 jR
+

R
+

)g
Denition 3.3 R
r=w
(; type) = f( = x:r:v
1
;  = x:w(v
2
))j( 2 ;  2 
w
) ^ ( 
type) ^ 9(( = x:w(v
1
)) 2 
w
^ ( = x:r:v
2
)j(:type type) ^ ((R
w

)
+
R
+

))g
Denition 3.4 A computation E = h;Ri is Coherent (CO) i
h; (R[R
ww
(; co))
+
i is a partial order.
Denition 3.5 R
Causal
= R
+
[R
r=w
(; ca)[R
w=w
(; ca). A computation is Causal i
h;R
Causal
i is a partial order.
Denition 3.6 R
PRAM
(
ij
) = (R
w

)
+
[ R
w=w
(; p) [ R
r=w
(
i
; p), where  = 
ij
. A
computation is a PRAM computation i 8i; j 2 P; the set of all processes;
h
w
ij
;R
PRAM
(
ij
)i is a partial order.
Denition 3.7 R
Slow
(
ijx
) = (R
w

)
+
[ R
w=w
(; s) [ R
r=w
(
ix
; s), where  = 
ijx
. A
computation is a Slow computation i (8i; j 2 P; 8x 2 L; h
w
ijx
;R
Slow
(
ijx
)i is a partial
order.
Denition 3.8 R
LC
(
ix
) = (R
w

)
+
[ R
w=w
(
w
; lc) [ R
r=w
(; lc)), where  = 
ix
. A
computation is Locally Consistent i 8i; x; h
w
ix
;R
LC
(
ix
)i is a partial order.
A hybrid execution is correct i it is Coherent, Causal, PRAM, Slow and locally consis-
tent. Note again that this model does not have any liveness requirement.
3.3 Using Mermera
Having specied the behavior of Mermera we will now give two examples showing how it
can be used. We use all the behaviors of Mermera to solve a system of linear equations
using an asynchronous iterative method [BT89]
4
. We also show how the operations of
Mermera can be used to achieve barrier synchronization.
4
The potential performance advantage of asynchronous iterative methods was rst established by Baudet
in [Bau78].
29
Epsilon = 0.0001 /? Accuracy desired ?/
do
f do
f AbsoluteDi = 0;
for (i = MyLow; i < MyHigh; i++)
f NewXi =  (b
i
+
P
i 1
j=1
a
ij
 x
j
+
P
m
j=i+1
a
ij
 x
j
)=a
ii
;
/? Use Read to read x
j
?/
AbsoluteDi = AbsoluteDi + abs(NewXi - Read(XStartLoc + i));
Slow Write(XStartLoc + i, NewXi);
g
gwhile (AbsoluteDi < Epsilon) /? Local termination check ?/
for (i = MyLow; i < MyHigh; i++)
PRAM Write(XStartLoc + i, Read(XStartLoc + i)); /? Ensure that values
propagate to all processes
?/
barrier(); /? Wait for all processes to satisfy local termination ?/
gwhile (!global termination());
Figure 3.2: A linear equation solver.
3.3.1 Solving a System of Equations
In this section we present a program that implements an asynchronous iterative algorithm
to solve a linear system of equations Ax+ b = 0, using non-coherent memory on p processes.
x and b are vectors of size n and A is an nn matrix. As mentioned earlier the asynchronous
iterative method to solve this system will converge if (jAj) < 1.
The program shown in Figure 3.2 is executed by each of the p participating processes.
Each process except the p
th
process computes b
n
p
c elements of x. Process p computes
n  (b
n
p
c) (p  1) elements. The inner loop is executed until a local termination condition
is satised. When a process reaches local termination it performs a barrier synchronization
with other processes. This ensures that each process satises its local termination condition
and that the values of x produced in the last iteration before the local termination are
propagated to all processes. Then each process does a global termination check by running
an iteration to compute all elements of x. If the check succeeds, the program terminates
with process 1 writing the solution to a le.
Our specication of Slow memory permits Mermera to propagate Slow Writes on a
best-eort basis, i.e., the system does not guarantee propagation of values written using
Slow Write to all processes. It is for this reason that every process uses PRAM Write after
every local termination. This satises the progress requirement mentioned in Section 2.3.
In this application, a process does n  1 Read operations on shared memory for computing
each x
i
that it is supposed to compute. It then does one Slow Write. Therefore it does
n   1 Reads for each write operation. An implementation of Mermera that makes Read
operations cheap would be ideal for this application.
30
barrier()
f while ((i < NumOfProcs) && (Read(i + 1) == false))
CO Write(i, true);
while ((i > 1) && (Read(i) == true))
CO Write(i + 1, false);
g
Figure 3.3: Pseudo C code for process i to implement barrier synchronization. Locations 1
through (NumOfProcs + 1) in shared memory are reserved for use in barrier synchroniza-
tion.
3.3.2 Barrier Synchronization
Let there be n cooperating processes running concurrently. Each of them makes k calls to
the function barrier. The requirement is that their i
th
call to barrier return only after all
of the n processes have made the i
th
call to barrier. Our algorithm (Figure 3.3) uses n+ 1
shared ags, all initialized to false. The barrier is cleared after two phases in which those
ags are all set, and then reset. In the rst phase, every process sets its own ag, using
a CO Write, after waiting for all processes with a higher ID to set their respective ags.
Thus the ags are set in decreasing order of process ID's. Dually, in the second phase,
the ags are reset in increasing order of process ID's. This guarantees that no process will
clear the barrier until all others have reached it, and that no process will break the barrier
synchronization if it invokes barrier while some processes are still in it.
We cannot use Slow Write to modify the ags because they are not guaranteed to be
applied at other processes. Loss of a write in this case can result in a process waiting forever
in one of the twowhile loops in the algorithm. We do not use PRAM Write because it may
get buered and propagated after some delay. In the meantime the task doing the barrier
cannot do anything. A process has nothing to gain by buering the rst of the two writes
in the algorithm. The second write can be a PRAM Write and it will allow process i to
propagate the write later but this will be at the expense of process (i + 1) having to wait
longer to cross the barrier.
In the next two chapters we describe the implementation of Mermera on a BBN But-
tery TC2000 and on a network of SUN SPARCstation 1+ workstations. We will also
describe the performance of the linear solver described in Section 3.3.1.
Chapter 4
A Pilot Study on a BBN Buttery
In the last chapter we established that Non-coherent memories could be used for certain
applications. We also argued that the implementation of Non-coherent memories would be
more ecient than that of Coherent memory.
In this chapter we describe a pilot implementation on a BBN Buttery TC2000. The
purpose of this implementation is to get a feel for the the improvement in access time that
one can get by making the memory non-coherent. Section 4.1 describes the algorithms to
implement a subset of the operations provided by Mermera. We implement CO Write,
PRAM Write and Read operations. Our algorithms favors Read operations by using a fully
replicated update based protocol. In Section 4.2 we present performance measurements
from our implementation.
4.1 Implementation
Having described the behavior ofMermera, we now present a description of a pilot imple-
mentation on a BBN Buttery TC2000 using the Uniform System [Mat90]. We implement
PRAM Write, CO Write and Read operations. The purpose of this partial implementation,
which does not support hybrid computations, is to get a feel for the improvement in memory
response time that one may get by tolerating weaker coherence.
The TC2000 is a distributed memory machine, where a remote memory reference takes
about 2.1 microseconds|three times longer than a local memory reference. The particular
machine used has 45 processors, which run in a dedicated mode, i.e., one process per
processor.
We implement update based protocols for CO Write and PRAM Write. Each process
has a copy of the shared locations. All writes are broadcast to all processes and reads return
the value in the local copy of the location being read.
We have two implementations of CO Write. The rst uses traditional two-phase locking
(2PL), and the second employs a pipelined locking protocol (PLP) that is much more e-
cient than 2PL under certain loads. These protocols are given in sections 4.1.1 and 4.1.2.
Section 4.1.3 describes an algorithm to implement PRAM. All these algorithms are pro-
grammed as macros to avoid the penalty of an additional function call for each read or
write operation. We nd that the PRAM outperforms both the 2PL and PLP implementa-
tions of coherent memory, by at least an order of magnitude, even under loads that maximize
31
32
PLP's ability to pipeline memory requests eciently.
4.1.1 Two phase locking (2PL)
We use a traditional 2PL protocol, in which a write obtains locks on all copies of a memory
location before it releases any of them, and a read operation locks and reads only the local
copy of the memory location. Read and write locks are not distinguished, since concurrent
reads never attempt to obtain the same lock. Thus we ensure the serializability of all
coherent read and write operations. Sequential consistency is guaranteed by additionally
respecting program ordering, by blocking a process that invokes a memory operation until
that operation is completed. To avoid deadlocks, we require all processes to obtain the locks
in the same xed order, a technique often termed conservative 2PL.
4.1.2 Pipelined locking protocol (PLP)
In this protocol we associate a single lock with the entire copy of the shared memory at
each process. When a copy of any shared location in a process has to be modied, the whole
copy of shared memory at that process is locked. Writes are propagated to copies in a xed
order, with the lock being acquired before the value is written, and released only after the
next copy's lock is acquired. The advantages of this protocol are that shared memory reads
do not require any locking, and that write locks are held only for the duration of updating
one copy of the memory location. Thus, multiple writes to the same location can proceed
concurrently in a pipelined fashion, as long as they do not require access to the same copy
of memory at the same time. This makes PLP suitable for shared memories with many hot
spots. Its disadvantage is that concurrent writes to dierent locations are also pipelined.
The PLP algorithm is given in Figure 4.2, where UsLock and UsUnlock are atomic
operations provided by the Uniform System on BBN Buttery. MemCopyPtrs[i] contains
the lock for the copy of shared memory at process i and a pointer to the copy of the 0
th
location at process i. A read operation does not need any locking|it simply returns the
value in the local copy of shared memory.
4.1.3 PRAM algorithm
The algorithm for PRAM Write, described in Figure 4.3, does not require any locking. The
condition that a process' writes be observed in the order they are done at the writer is
trivially satised because the writer blocks until the value is written in all processes' copies
of the shared memory.
4.2 Performance Results
In this section we report our performance measurements of this pilot implementation.
4.2.1 Access Time
We measure the performance of PRAM Write, the two implementations of CO Write (i.e.,
2PL Write and PLP Write) and the corresponding Read operations. We give the average
33
#dene TimeOut 10 =? Number of microsec-
onds to wait before retry-
ing for a lock ?=
struct LocCopyStruct
f short Lock; =? Per copy per location
?=
int Value;g;
struct LocCopyStruct ?MemCopyPtrs[MaxProcs];
int NumProcs;
CO Write 2PL(int Loc, Val) =? Requires the use of
Read 2PL ?=
f register int i;
register struct LocCopyStruct ?LocCopyPtr;
for(i = 0; i < NumProcs; i++)
f LocCopyPtr = MemCopyPtrs[i] + Loc;
UsLock(&(LocCopyPtr {> Lock), TimeOut);
LocCopyPtr {> Value = Val;g
for(i =0; i < NumProcs; i++)
UsUnlock(&((MemCopyPtrs[i] + Loc){>Lock));
g
Read 2PL(int Loc, ?Buer) =? Must use instead of
Read with Co Write 2PL
?=
f register struct LocCopyStruct ?LocCopyPtr;
LocCopyPtr = MemCopyPtrs[UsProc Node] + Loc;
UsLock(&(LocCopyPtr {> Lock), TimeOut);
?Buer = LocCopyPtr {> Value;
UsUnlock(&(LocCopyPtr {> Lock));
g
Figure 4.1: The two phase locking protocol.
34
#deneTimeOut 10 =? Microseconds to wait before retrying for a
lock ?=
struct MemCopyPtrStruct
f short ?LockPtr; =? A lock per memory copy ?=
int ?LocPtr;g; =? Pointer to copy of 0
th
variable in shared
memory ?=
struct MemCopyPtrStruct MemCopyPtrs[MaxProcs];
CO Write PLP(int Loc, Val)
fregister int i=0, j, k;
UsLock(MemCopyPtrs[i].LockPtr, TimeOut)
for(i = 1; i < NumProcs; i++)
f ?(MemCopyPtrs[i-1].LocPtr + Loc) = Val;
UsLock(MemCopyPtrs[i].LockPtr, TimeOut);
UsUnlock(MemCopyPtrs[i-1].LockPtr);
g
?(MemCopyPtrs[i-1].LocPtr +Loc) = Val;
UsUnlock(MemCopyPtrs[i-1].LockPtr);
g
Read(int Loc, ?Buer)
f ?Buer = ?(MemCopyPtrs[UsProc Node].LocPtr + Loc);
g
Figure 4.2: The Pipelined Locking Protocol.
struct MemCopyPtrStruct
f int ?LocPtr; g =? Pointer to copy 0
th
location in shared
memory ?=
struct MemCopyPtrStruct MemCopyPtrs[MaxProcs];
PRAM Write(int Loc, Val) =? Use Read from PLP. =?
f register int i;
for(i = 0; i < NumProcs; i++)
?(MemCopyPtrs[i].LocPtr + Loc) = Val;
g
Figure 4.3: The PRAM algorithm.
35
of the response times of all processes for a certain number of read/write operations to a
single location.
1
Experimental Methodology
Each process does the same operations on the location in shared memory. The total response
time seen by each process for a given load is added together and the sum is divided by the
number of processes to obtain the average response time for that load.
We report on measurements for three dierent sets of loads. The rst set has 1000 Reads
and zero Writes, the second set has 500 Reads and 500 Writes, while the third set has zero
Reads and 1000 Writes. We vary the number of processes from 1 to 30, where each process
runs on a separate processor and each process imposes the load above. Every measurement
was repeated at least ten times, with negligible variance in the results.
Plots from our measurements are shown in gures 4.4, 4.5, and 4.6. In Figure 4.4 we
show the performance of reads that require locking the local copy (i.e., Read 2PL) and of
reads that do not (i.e., Read dened in Figure 4.2). This gives us an estimate of the cost of
obtaining a local lock and releasing it (about 2.25 microseconds). Consequently, reads can
be considered of zero cost relative to writes.
Figures 4.5 and 4.6 show the plot of the average response time against the number of
processes for two sets of loads. These show the performance of the system under a load of
only Write operations and under a load in which 50% of the operations are Read operations
and 50% are Writes.
Discussion of Measurements
In all three methods the cost of writing dominates the cost of doing the corresponding read.
Our most important observation is that PRAM Write is faster
2
than CO Write by a factor
of about 40 in the case of the two phase locking implementation and by a factor of about 10
in the pipelined locking implementation. These measurements indicate that non-coherent
memory possesses at least a ten-fold advantage over coherent memory in terms of response
time. This performance advantage is robust against signicant optimization of Coherent
memory implementation, and against a low latency dierential between remote and local
communication. However, this implementation does not exploit the fact that PRAM Write
can be buered and computation and communication can be overlapped.
4.2.2 Solving a System of Equations
In this section we briey present the performance of the equation solver described in Sec-
tion 3.3.1.
1
This choice of load, while simplistic, favors coherent memory because it enables PLP to exhibit its best
performance.
2
At 30 processors (processes), 500 PRAM Reads and 500 PRAM Writes took 48 milliseconds to nish,
while 1000 writes took about 94 milliseconds to conclude.
36
Average response time for 1000 reads.
Read_2PL
Read w/o lock
Average Response Time (in microseconds) x 103
# of Processes
0.00
0.50
1.00
1.50
2.00
2.50
3.00
0.00 5.00 10.00 15.00 20.00 25.00 30.00
Figure 4.4: Performance of Read operations. All times are in microseconds.
37
Average response time for 500 reads and 500 writes
CO_Write_2PL
CO_Write_PLP
PRAM_Write
Average Response Time (in microseconds) x 106
# of Processes
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
2.20
0.00 5.00 10.00 15.00 20.00 25.00 30.00
Figure 4.5: Performance of Read and Write operations. All times are in microseconds.
38
Average response time for 0 reads and 1000 writes
CO_Write_2PL
PLP_Write
PRAM_Write
Average Response Time (in microseconds) x 106
# of Processes
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
0.00 5.00 10.00 15.00 20.00 25.00 30.00
Figure 4.6: Performance of Write operations. All times are in microseconds.
39
Time to solve 500 linear equations
CO_Write
PRAM_Write
Time (in seconds)
Num of Procs4.00
6.00
8.00
10.00
12.00
14.00
5.00 10.00 15.00 20.00 25.00
Figure 4.7: Performance of the solver on the TC2000.
Experimental Methodology
The program shown in Figure 3.2 was implemented. The Slow Writes were replaced by
PRAM Writes. This does not aect the correctness of the algorithm because computations
allowed by PRAM are also allowed by Slow Memory. The performance of this program was
compared with a program that used CO Writes instead of PRAM Writes. Between 1 and 25
processors were used to solve a system of equations in 500 variables and the time taken for
the iteration to converge was measured. The time taken for the convergence to be detected
was included in the measurements. Our measurements are summarized in Figure 4.7.
Discussion of Measurements
The program using PRAM Writes performs up to 30% faster than the program using
CO Writes. The poor improvement can be attributed to two factors. First, the dierence in
the two programs is in the way they do their writes to the shared-memory. But the number
of writes is a very small fraction
3
of the operations done in the iterations. Therefore,
optimizing them even by an order of magnitude improves the overall performance by only a
small amount. Second, PRAM Writes allow writes to be buered and the communication
costs can be amortized over several writes. But our implementation does not exploit this.
An implementation that takes advantage of this freedom will be able to amortize the cost
of remote communication over several writes.
The performance of the solver is rather discouraging. It warns us that in the absence of
buering achieving even a 10-fold improvement in response time may not be sucient to get
a signicant improvement in the performance of certain applications. The read/write mix
of the application also plays a signicant role in the performance. In the next chapter we
3
For each write to shared memory each process does 2n reads of local memory, n additions and 1 division,
where n is the number of variables in the system of equations being solved.
40
present an implementation of mermera on a network of SUN SPARCstation 1+ worksta-
tions. In that implementation we buer writes and we achieve a much higher performance
improvement by using non-coherent memory.
Chapter 5
A Prototype on a Network of
Workstations
In this chapter we describe an implementation of Mermera on a network of SUN SPARC-
station 1+'s using the Isis toolkit. This implementation takes advantage of the fact that
non-coherent writes to the shared memory can be buered. The performance of our im-
plementation is described in Section 5.2. Our experiment demonstrates the performance
improvement that can be achieved by certain applications by using non-coherent memory.
In Section 5.3 we discuss the impact of Isis on our implementation.
5.1 Isis Implementation
In this section we describe an implementation ofMermera on a network of SPARCstations
running Unix. We use version 2.2.5 of the Isis toolkit [BSS91, BJ87] for our implementation.
Our algorithms are update based, i.e., each process has a copy of the shared memory and
this copy is updated as writes occur.
1
A Read returns the value in the local copy. This
is in concurrence with our decision to make Read operations cheap. A write operation to
shared memory updates the local copy and propagates the value to other processes. How
this propagation is done depends on the type of the write operation.
The specication of Mermera does not require that all writes be propagated to other
processes. Only CO Writes and PRAM Writes are guaranteed to be propagated to all
processes. Slow Writes can be propagated on a best eort basis, i.e., the system tries to
propagate them but no guarantees are given.
2
The only guarantee is that the correctness
conditions will not be violated. According to the specication of Mermera, Local Writes
are not propagated. Only the local copy is updated. The weaker ordering constraints on
non-coherent operations also allow us to buer multiple writes and propagate them together.
The writes are propagated by multicasting the values to other processes. We use the
Isis toolkit for our implementation because it gives us a suite of multicasts to groups of
processes which satisfy dierent ordering properties. The multicasts of interest to us are
1
The implication of this full replication on the performance comparisons is discussed in Section 5.2.
2
This interpretation of Slow Writes has not been implemented. The current implementation guarantees
propagation of Slow Writes.
41
42
abcast(), fbcast() and mbcast(). These primitives have dierent constraints on the order in
which the messages are delivered to their destinations. We summarize these dierences in
the next few paragraphs.
All messages sent using abcast are delivered in the same order at all destinations, i.e.,
there exists a total order on the order in which these messages are received by a process and
this order is the same for all processes. This is exactly the property we want for CO Write.
Messages sent using fbcast obey a weaker constraint. Messages sent by the same process
are received by all processes in the order they were sent. However, fbcasts sent by dierent
processes may be interleaved in dierent orders at dierent processes. This suces for
PRAM Writes, so we use fbcasts to propagate them.
Messages sent using mbcast do not provide any ordering guarantees.
3
We use this prim-
itive for propagating Slow Writes.
No ordering guarantees are provided amongmessages sent using dierent primitives, e.g.,
if two messages are sent one after the other using abcast and fbcast, respectively, they are
not guaranteed to be received in the order they were sent. We take this into consideration in
our implementation and ensure that the correctness conditions of Mermera are satised.
Our technique is explained later in this section.
Another feature of Isis that we use is its lightweight task system. This allows us to have
several concurrent tasks in a user process. These tasks are non-preemptive which makes
synchronizing accesses to dierent data structures very easy, i.e., we do not have to worry
about enforcing mutual exclusion on accesses to data structures. A task controls when it
gives up the CPU to other tasks in the process. The delivery of a message to a process
causes a task to be created. Tasks are scheduled in FIFO order. Each message names an
entry point that species the task to invoke to handle the message.
We now explain how we combine the dierent operations
4
. These algorithms are given
in gures 5.2, 5.3, 5.4 and 5.5. All writes other than CO Writes may be buered. The
structure of the buer is shown in Figure5.1. Its size can be dynamically adjusted
5
. The
(Location, V alue) pair of a CO Write is appended to the buer and the entire buer is
immediately broadcast using the abcast protocol to all processes (including the writer).
The task that issues the operation continues only after the message has been processed
by its process. This ensures that the message is processed in the global order of abcasts.
In case of PRAM Writes and Slow Writes the local copy is immediately updated and the
updates are appended to the buer. This buer is broadcast when it gets full or when a
CO Write needs to be broadcast. Furthermore, it is also broadcast when a preset timeout
occurs. If the buer is not full then the task issuing the PRAM Write or Slow Write can
continue after the local copy has been updated but before the buer is sent. The broadcast
protocol used depends on the type of the strongest write in the buer (according to the
hierarchy in Figure2.11 with CO Write being the strongest). The fbcasts and mbcasts are
sent asynchronously to all processes except the writer. The abcasts and fbcasts are reliable
broadcasts in the sense that the eventual delivery of messages sent using these broadcasts
is guaranteed. We use mbcasts, which happen to be reliable, for Slow Writes but a best
3
The current implementation of Isis does deliver them in the order they were sent by the sender.
4
In Section 5.3 we mention some optimizations that can be made to our implementation.
5
For the sake of simplicity we allow each write to write only oating point numbers to shared memory.
This can be modied to allow writes that modify an arbitrary number of bytes in shared memory.
43
SenderId
SeqNo Sequence #, Consistent with the or-
der in which buers are sent by this
process.
LastRelSeqNo SeqNo of last reliable (abcast or
fbcast)broadcast.
Type Type of strongest write in this buer.
Location Value
: : : : : :
: : : : : :
Figure 5.1: Structure of buer. This buer is the message that is broadcast.
eort policy does not require this reliability. As discussed in Section 5.3 we may switch to
an unreliable broadcast protocol for propagating Slow Writes.
When an update message is delivered to a process a task specied by the update memory
function (Figure5.4) is created. This task is responsible for applying the updates in the
message. If applying the update will violate any of the correctness conditions then that
buer is put in a waiting list. If the buer is a CO buer then it is also enqueued in a list
of CO buers. Whenever an update is applied, the waiting list is checked to see if updates
from any other buer can be applied without violating the correctness conditions.
Roughly speaking, updates from a buer are applied if all reliable messages
6
from the
sender of the buer have been received (and their updates applied). In case of CO buers
an additional check has to be made to ensure that updates from all previous (in the total
order of abcast messages) abcast messages have been applied. Slow buers are forced to
wait only if the receiver is waiting for a reliable message from the same process. If a Slow
buer arrives after updates from a subsequent (in the order of buers sent by its sender)
buer have been applied then its updates are ignored.
5.2 Performance Results
In this section we describe the performance of our implementation. We conduct two types
of experiments. First, we measure access time and completion time (dened in the follow-
ing subsection). Second, we measure the performance of the equation solver described in
Section 3.3.1 on our implementation. Two versions of the solver are used. The rst version
uses all the behaviors of Mermera while the second version uses coherent behavior only.
5.2.1 Access Time and Completion Time
The access time of an operation is dened to be the duration of time between the invocation
and return of the operation. In case of non-coherent write operations this does not imply
that when the write returns, the value written has been propagated to all processes sharing
6
That is, messages sent using abcast or fbcast.
44
CO Write(int Loc, int Val)
f Append (Loc, Val) to Buer;
Buer.Type = CO;
broadcast(Buer);
Wait for update-received signal.
g
PRAM Write(int Loc, int Val)
f Update local copy of Loc.
Append (Loc, Val) to Buer;
if (Buer.Type == Slow)
Buer.Type = PRAM;
if (Buer is full)
broadcast(Buer);
g
Slow Write(int Loc, int Val)
f Update local copy of Loc.
if (Buer is empty) BuerType = Slow;
Append (Loc, Val) to Buer;
if (Buer is full)
broadcast(Buer);
g
Local Write(int Loc, int Val)
f Update local copy of Loc;g
Figure 5.2: Pseudo C code for write operations.
45
broadcast(Buer)
f static int ReliableSeqNo = 0; /? Sequence # of the last reliable broadcast?/.
static int SequenceNo = 0;
Buer.SeqNo = SequenceNo;
Buer.LastRelSeqNo = ReliableSeqNo;
switch (Buer.Type)
f case Slow:
Asynchronous-mbcast(Buer); /? To all but self ?/
break;
case PRAM:
Asynchronous-fbcast(Buer); /? To all but self ?/
ReliableSeqNo++;
break;
case CO:
Asynchronous-abcast(Buer); /? To all including self ?/
ReliableSeqNo++;
break;
g
SequenceNo++;
g
Figure 5.3: Pseudo C code for broadcast().
46
WaitList[];
/?WaitList[i] contains the RelSeqNo and the SeqNo of the last buer pro-
cessed from process i and a PendingBufList of buers that have been re-
ceived by this process but whose updates have not been applied at the local
copy ?/
AbcastQ; /?A queue of unprocessed Abcast buers ?/
update memory(CurrentBuer) /? Invoked when a update message is received ?/
BuerStruct *CurrentBuer;
f if ((WaitList[CurrentBuer.Sender].LastRelSeqNo < CurrentBuer.RelSeqNo) jj
((CurrentBuer.Type == CO) && (AbcastQ.Head != NULL)))
f insert in WaitList(CurrentBuer);
/? Inserts CO Buers in AbcastQ also ?=
return;g
while (CurrentBuer != NULL)
f Sender = CurrentBuer.Sender;
switch(CurrentBuer.Type)
f case CO:
WaitList[Sender].LastRelSeqNo = CurrentBuer.SeqNo;
if (CurrentBuer.SenderId == MyId)
f Apply only the CO Write which is the last update in the buer.
Send update received signal so that waiting task can continue. g
else Apply all updates to the local copy
break;
case PRAM:
Apply updates to local copy;
WaitList[Sender].LastRelSeqNo = CurrentBuer.SeqNo;
break;
case Slow:
if (CurrentBuer.SeqNo > WaitList[Sender].LastSeqNo)
Apply all updates to the local copy;
break; g
WaitList[Sender].LastSeqNo = CurrentBuer.SeqNo;
CurrentBuer = enabled Buer(Sender);g
g
Figure 5.4: Pseudo C Code for the handler for the update memory message. The function
enabled Buer(Sender) is described in Figure5.5. It returns a pointer to a buer whose
updates can be now be applied.
47
/? Checks if any of the pending buers can be processed ?/
enabled Buer(Sender)
f CurrentBuer = NULL;
PendingBuf = WaitList[Sender].PendingBufsList.Head
if (WaitList[Sender].LastRelSeqNo == PendingBuf.RelSeqNo)
f if (PendingBuf.Type != CO)
f Remove PendingBuf from PendingBufList
CurrentBuer = PendingBuf;g
else if (AbcastQ.Head == PendingBuf) /? CO Buer should be at the head
of the AbcastQ to be processed ?/
f Remove PendingBuf from PendingBufList and AbcastQ.
CurrentBuer = PendingBuf;gg
if (CurrentBuer == NULL)
f PendingBuf = AbcastQ.Head
if (WaitList[PendingBuf.Sender].PendingBufList == PendingBuf)
f CurrentBuer = PendingBuf;
Remove PendingBuf from PendingBufList and AbcastQ.g
g
return CurrentBuer;
g
Figure 5.5: Pseudo C code for enabled Buer() function.
memory. To measure the time taken for the operation to be invoked and the propagation
to complete we use a metric that we call completion time. This is the time taken for each
process to execute 100 writes in parallel and for the propagation of all these (p100) writes
to be propagated to all the p participating processes.
Experimental Methodology
Our experiments are conducted on a dedicated network of 6 SPARCstations. The parame-
ters that are varied are: the number of processes sharing memory (from 1 to 6) and the size
of the buer that is used to buer the non-coherent writes (1, 10, 100, 1000 location-value
pairs). For each parameter setting we run the experiments over 100 times and we report
the fastest times measured.
The amount of work done for non-coherent writes may be dierent each time the opera-
tion is invoked. This is because if a write causes the buer to get full then the propagation
of the buer has to be initiated before the writing process can continue. So, we measure the
access time for 100 writes for each setting of the parameters. The access times observed by
the dierent processes may be dierent. This is especially true for CO Writes which relies
on the total order of all abcasts. Isis achieves this total order by designating one of the
participating processes, p
0
to be the sequencer. A consequence of this is that CO Writes
by process p
0
have a much faster access time than other processes because they do not need
48
CO Write PRAM Write Slow Write
Buer Sizes Buer Sizes
Procs 1 10 100 1000 1 10 100 1000
1 133.0 46.9 5.9 1.2 .66 46.9 5.5 1.2 .62
2 532.0 108.0 16.8 2.5 .66 96.4 16.1 2.4 .62
3 1005.0 101.3 26.6 3.9 .66 80.6 25.9 3.9 .62
4 2049.0 110.8 28.1 5.2 .66 86.8 27.8 5.1 .62
5 3561.0 117.8 27.7 6.3 .67 86.4 27.2 6.3 .63
6 5378.0 120.5 28.0 7.1 .66 91.4 27.1 7.1 .62
Table 5.1: Access time in milliseconds for 100 writes. Buer Size is measured in terms of
the number of write operations that can be buered.
any remote communication to determine their position in the total order. The average over
the access time observed by each process is reported.
To measure the completion time, each process performs 100 writes between two barrier
synchronization calls. Again, the average of the times observed by each process are reported.
Tables 5.1 and 5.2 list our measurements.
Discussion of Measurements
For a buer size of 1000, 100 PRAM Writes take 0.66 milliseconds which shows that it takes
6:6s to execute the code that buers each write. The code to buer each Slow Write takes
6:2s.
The buer size has no signicant eect on the performance of CO Writes because no
buering is done for them. In case of PRAM Write and Slow Write the buer size is the
most important factor that aects the access time and the information ow time. This is
because the buer size determines the number of messages that are sent and received. The
access time for the non-coherent writes is almost independent of the number of processes
because the broadcast of their writes is done asynchronously. We do not know why this is
not true for a buer size of 100. When the buer lls up it is scheduled for broadcast and
the writing task can continue. On the other hand, a task doing a CO Write has to wait till
the message is delivered to the writing process before it can continue. The message will be
delivered only after its position in the total order of abcasts is determined. When the task
goes into a wait state messages sent by other processes can be received causing a delay in
the resumption of the waiting task. This explains the faster than linear (in the number of
processes) growth of the access time of CO Writes.
The completion time includes the time spent in doing the asynchronous broadcasts and
the time spent in executing tasks that are spawned as a result of incoming updates. The
number of messages sent and received is a signicant determining factor for completion time.
This number depends on the buer size. However, if we do at most 100 writes then all their
updates can be sent in a buer of size 100. Therefore, increasing the buer size beyond 100
does not have any eect on completion time. For small buer sizes the writes generate a
large number of small messages. Isis can coalesce these small messages to larger messages
for more ecient transmission. This explains the observation that for a given number of
49
CO Write PRAM Write Slow Write
Buer Sizes Buer Sizes
Procs 1 10 100 1 10 100
1 136.8 50.4 9.4 4.8 46.8 9.0 4.7
2 1030.0 267.2 54.2 26.2 245.3 53.9 27.1
3 1574.0 412.9 100.0 52.4 383.4 99.7 53.9
4 2889.0 638.0 148.0 85.8 601.4 145.8 85.7
5 4645.0 963.8 188.4 124.9 871.9 187.9 125.4
6 6761.0 1204.0 247.9 170.9 1223.0 246.8 174.0
Table 5.2: Completion time in milliseconds for 100 writes.
processors the completion time grows at a rate sub-linear in the number of messages sent.
The access time and completion times for CO Writes are close because the FIFO schedul-
ing of tasks by Isis' light weight task system. If a task is waiting for its process to receive
its broadcast then all tasks resulting from update messages that arrive in the meantime will
be executed before the waiting task continues. As a result of this, by the time a process
returns from its 100 writes it would have already processed many of the writes by other
processes.
Table 5.2 shows that for a xed buer size the completion time of non-coherent memories
depends linearly on the number of processes sharing memory. On the other hand the
completion time of coherent writes grows faster than linearly as the number of processes
is increased. Completion times for CO Writes is 3-38 times slower than the non-coherent
writes. The factor increases as the number of processes increases.
One may argue that this comparison of the completion time of coherent writes and
non-coherent writes is unfair because few coherent memories are implemented using full
replication. More ecient implementations such as directory based schemes (see [CKA91]
for an example) exist. Our response to this argument is that there are some applications
(e.g., the linear solver of Section 3.3.1) which intrinsically generate a message trac that
will be similar to the trac generated by using full replication to implement shared mem-
ory. The characteristic of such applications is that all participating processes read the
values written by all other processes regularly. Moreover, one may be able to devise more
ecient implementations of non-coherent memory by exploiting techniques developed for
cache consistency.
The performance of PRAM Write and Slow Write are comparable. We expect the per-
formance of Slow Write to improve signicantly if the propagation is done on a best eort
basis rather than in a reliable manner as is done in the implementation described.
5.2.2 Solving a System of Equations
In this section we study the performance of the linear equation solver described in Chapter3.
The performance of the solver that uses the non-coherent behaviors of Mermera is com-
pared with a program that uses the coherent behavior only. The eect of buer size on the
performance of the solver is discussed.
50
Epsilon = 0.0001 /? Accuracy desired ?/
do
f do
f AbsoluteDi = 0;
for (i = MyLow; i < MyHigh; i++)
f NewXi =  (b
i
+
P
i 1
j=1
a
ij
 x
j
+
P
m
j=i+1
a
ij
 x
j
)=a
ii
;
/? Use Read to read x
j
?/
AbsoluteDi = AbsoluteDi + abs(NewXi - Read(XStartLoc + i));
Slow Write(XStartLoc + i, NewXi);
isis accept events();
g
gwhile (AbsoluteDi < Epsilon) /? Local termination check ?/
for (i = MyLow; i < MyHigh; i++)
PRAM Write(XStartLoc + i, Read(XStartLoc + i)); /? Ensure that values
propagate to all processes
?/
barrier(); /? Wait for all processes to satisfy local termination ?/
gwhile (!global termination());
Figure 5.6: Solver with isis accept events().
Experimental Methodology
In Chapter3 we presented a program to solve a linear system of equations, Ax + b = 0,
using the dierent operations oered by Mermera. The program is executed with a minor
modication shown in Figure5.6. The modication is the addition of a call to the function
isis accept events which allows Isis to deliver messages received from other processes.
The program solves a system of randomly generated
7
equations in 1000 variables. The
matrix A is dense. The number of processes is varied from 1 to 6. Each process has a copy
of A and b. The vector x is in shared memory. The performance is measured in terms of the
time it takes for the iterations to converge and the convergence to be detected. This time
is called the convergence time and it does not include the time needed for the propagation
of A and b to all processes.
A number of buer sizes between 1 and 1000 are used to determine the eect of dierent
buer sizes on the performance of the solver. The performance of the program in Figure5.6
is compared with that of a program in which all Slow Writes are changed to CO Writes and
the PRAM Writes are deleted. The fastest convergence times from tens of runs for each
parameter setting are used to derive our conclusions. Because of the asynchronous method
the number of iterations done by each process was dierent and it varied from run to run.
7
To ensure (jAj) < 0, all elements of jAj are less than 1 and to ensure the numerical stability of the
algorithm the diagonal elements of A are much larger than the other elements.
51
Procs T
CO
T
mixed
T
CO
T
mixed
1 158.2 157.1 1.01
2 560.5 50.8 11.0
3 296.5 46.4 6.4
4 260.0 37.7 6.9
5 176.3 32.9 5.5
6 201.8 30.0 6.7
Table 5.3: Performance measurements for n = 1000. All times in seconds.
T
CO
= time taken for iteration to converge using coherent writes only.
T
mixed
= time taken for convergence using mixed behavior shown in Figure5.6.
A purely sequential Gauss-Seidel iteration that does not use Mermera converged in 143
seconds after doing 42 iterations.
Discussion of Measurements
Our measurements are summarized in table 5.3 and plotted in Figure5.7. The fastest
convergence time regardless of the buer size are used in these gures.
The main conclusions we derive from these measurements are:
1. Using non-coherent behavior instead of coherent behavior alone improves the perfor-
mance of our asynchronous iterative algorithm by a factor ranging from 5.5 to 11.0.
From the trend of our measurements we expect this improvement to increase as more
processors are added.
2. The program using coherent memory does not give any performance improvement
when we use more than one processor. Using multiple processes always takes longer
than using a sequential Gauss-Seidel iteration. On the other hand, using non-coherent
behavior gives us better performance.
3. The performance improvement the program using coherent behavior breaks down
at 5 processors while the performance of the program using non-coherent behavior
continues to improve up to at least 6 processors.
8
Eect of Buer Size: The specication of Mermera imposes no constraints on the
number of non-coherent writes that can be buered. The implementation is free to choose
this parameter. Our observations summarized in Figure5.8 show that the buersize has a
signicant eect on the performance of the solver.
For any number of processors the performance is the best when the buersize is 12-
14. This is because for smaller buers the frequency at which messages are sent is high
which imposes a high overhead of sending and receiving messages on the CPU. If the buer
is large then the frequency of messages is low but each process uses less recent values of
8
Because of limited equipment we could not increase the number of processors beyond 6.
52
Time to solve 1000 linear equations
CO
Mixed
Time (in seconds)
Num Of Procs
50.00
100.00
150.00
200.00
250.00
300.00
350.00
400.00
450.00
500.00
550.00
2.00 3.00 4.00 5.00 6.00
Figure 5.7: Convergence time vs. Number of processors.
53
Time to solve 1000 linear equations
p=2
p=3
p=4
p=5
p=6
Time (in seconds)
Buffer Size
25.00
30.00
35.00
40.00
45.00
50.00
55.00
60.00
65.00
70.00
75.00
80.00
85.00
90.00
0.00 20.00 40.00 60.00
Figure 5.8: Eect of Buersize on performance.
54
the components of x being computed by other processes. The case of 2 processors is an
exception where the best performance is achieved when the buersize is 500.
9
5.3 Impact of Isis
Our implementation on a network of workstations uses the Isis toolkit to propagate the
eect of writes of each process to all other processes. The overhead imposed by parts of the
code external to Isis is very small. The implementation of Isis is highly optimized for the
functionality it oers.
For our purposes the following features that are not currently oered by Isis would be
useful.
Unreliable, unordered multicasts: Our implementation of Slow Writes uses the mbcast
multicast primitive provided by Isis. Isis uses a transport layer that guarantees the de-
livery of all mbcastmessages in the order they were sent by the sender. Neither of these
properties (reliability and FIFO ordering of messages) is required by Slow Writes. We
expect Slow Write implemented using an unreliable, unordered multicast to perform
better than the current implementation.
Use of IP Multicast: The current version of Isis implements a multicast to a group by
sending a point-to-point message to each member of the group. This causes the CPU
overhead for sending a message to be paid for each member of a group. The Deering
multicast protocol [DC90] can use the multicast facilities provided by the hardware.
As a result, the CPU overhead for a multicast can be greatly reduced. An added
benet is that the elapsed time for a multicast is reduced. All writes will benet
from this enhancement and this will also result in a better utilization of the network
bandwidth because there will be fewer messages on the network for each write done
by any process. The Horus project [vRBC
+
92] addresses this issue.
Our approach in this implementation was to use Isis multicast primitives that approx-
imately satisfy the conditions required by the dierent kinds of write operations. The
messages received through the dierent primitives are processed by the receivers in an or-
der that does not violate the correctness conditions in the specication. Since the receivers
check that messages are processed in a certain order it is possible to use mbcast (or any
protocol that guarantees message delivery) for propagating PRAM Writes. The sequence
number information in the messages is sucient to ensure the PRAM ordering. This op-
timization should improve the performance of PRAM Writes closer to that of our current
implementation of Slow Writes, which uses reliable message delivery.
Isis' implementation of abcast uses one of the processes' as a sequencer. A consequence
of this is that coherent writes done by that process have a much faster access time than
coherent writes by other processes. This can lead to load imbalances, e.g., in the linear
solver the sequencer process went through several more iterations than other processes.
To understand this eect we ran the solver with the sequencer process not participating
in the pool of processors that perform the iterations. The performance of this version of
9
For presentation reasons we have excluded that data point from Figure5.8.
55
the program was inferior to that of the version in which the sequencer participated in the
iterations. The eect of the load imbalance caused by the fast access time of the sequencer
is not clearly understood.
In this chapter we described the implementation and performance of Mermera on a
network of SPARCstations. The performance of a solver that uses the dierent behaviors
of Mermera was compared with that of one that used only its coherent behavior. This
comparison corroborates our belief that non-coherent memory can be a performance boon
for some applications.
Chapter 6
Conclusion
In Section 6.1 of this chapter we summarize the conclusions of our study and in Section 6.2
we speculate on the impact of technological advances on our results and explore directions
for future research.
6.1 Summary
In this thesis we presented Mermera, a model for parallel computing using non-coherent
shared memory (Figure 6.1). It oers programmers an opportunity to trade o programming
simplicity for performance. Programming with our system is more complex than using
Read and Write operations of coherent shared memory because of the existence of dierent
types of Write operations, but, as shown in our example of the linear solver, signicant
improvements in performance can be achieved over using coherent shared memory. In
Chapter 2 we developed a formal model to describe the non-coherent behaviors proposed in
the literature. We used this model to prove that totally asynchronous iterative algorithms
converge on Slow Memory | a non-coherent memory proposed in [HA90]. As an example
of such an algorithm, we presented a program to solve a linear system of equations. We
extended our model in Chapter 3 to describe the hybrid behavior of Mermera.
Chapters 4 and 5 described the the implementation and performance of our model
Non-coherent Shared Memory
. . .
...
PP
P
iP
PP
P P
j
Figure 6.1: Non-coherent shared memory model.
56
57
on a BBN Buttery TC2000 and on a network of SPARCstation 1+'s. We compared
the performance of the dierent types of writes of Mermera using the access time and
completion time measures. We also compared the performance of a version of the linear
solver that uses non-coherent operations of our model with that of a version that uses only
the coherent behavior. In our experiments on a network of SPARCstations, the former
outperformed the latter by a factor ranging from 5 to 11. These results were presented in
Section 5.2.2.
6.2 Future Work
This thesis showed how a totally asynchronous algorithm can achieve a signicant improve-
ment in performance by using a hierarchy of non-coherent behaviors. Totally asynchronous
iterative algorithms for several applications have been proposed in chapter 6 of [BT89]. A
technique to easily convert these algorithms forMermera would be very useful. In general,
we expect programmers to use coherent writes for the rst implementation of an algorithm.
They can then identify the portions of the program that can tolerate non-coherence and get
an improved performance.
Of still greater importance is to nd other applications than can benet in performance
by employing non-coherent memory. One possible area for exploration is the use of non-
coherent memory in real-time systems. Real-time systems have the characteristic that
computations have deadlines. Sometimes, an imprecise result of a computation may be
preferred to a missed deadline. A notion of imprecise computations has been proposed
in [CLL90]. We know that non-coherent memory operations complete faster than coherent
operations but they oer weaker ordering guarantees. A computation may choose to use
non-coherent operations and provide imprecise results rather than use coherent operations
and miss its deadline.
The architecture community has attacked the problem of poor performance of Co-
herent shared memory by oering weaker consistency constraints,e.g., processor consis-
tency [Goo91], weak ordering [AH90], release consistency [LLG
+
90] and entry consistency [BZ91].
All of these, except processor consistency
1
, require the programmer to distinguish between
synchronization accesses and ordinary accesses to memory. If the programmer makes this
distinction then the programs written for sequentially consistent memory will execute cor-
rectly on these memories. Compiler techniques to automate this process are being studied.
Extending our model to incorporate this distinction between synchronization accesses and
ordinary accesses is another research issue. In the same vein, it would be interesting to
extend the formal model to allow shared objects with arbitrary operations (e.g., incr/decr,
enque/deque).
In Chapter 5 we used access time and completion time as metrics to measure the per-
formance of our implementation. In the same chapter we presented the performance of a
linear equation solver. In terms of the metrics, the system performed the best for the largest
buer size. On the other hand, our application performed the best when the buer size was
12-14. We could not derive any correlation between the performance of our system on the
1
Processor Consistency is similar to PRAM. In addition to the correctness condition of PRAM, it requires
that there be a total order on all writes to a location and each process' view of the memory be consistent
with this order.
58
metrics above and the performance of our application. Research in the area of developing
performance metrics for non-coherent memories that can help us predict the performance
of applications on these memories can be very useful. Such a metric would allow a memory
designer to design memories that optimize these measures without having to be concerned
with individual applications. However, we are not sure about the existence of such a metric.
In the absence of such a metric, the programmer should have some control on the param-
eters that aect the performance of the memory. Compiler techniques to adaptively set
these parameters can also be developed.
Our implementation uses full replication of the shared memory. Alternative implemen-
tations can use partial replication. One such protocol using a directory-based scheme for
PRAM Writes is described in [LS88]. Enhancing such a protocol for a hierarchy of memory
behaviors is another area that deserves attention. But, rst we need to identify applications
that can benet from such an implementation. Our example of the linear solver benets
from full replication because the data written by any process is used by all other processes.
Advances in networking technology [ABC
+
89] have introduced the idea of having a net-
work co-processor at each workstation. This co-processor takes over the burden of sending
and receiving messages. Applications using non-coherent writes can benet from such an
enhancement because then there can be true parallelism in the overlapping of communi-
cation and computation. While a message (generated by a non-coherent write) is being
multicast by the co-processor the CPU can proceed with the computations of the thread
doing the non-coherent write. Applications using coherent writes cannot benet as much
from this enhancement because a thread doing a coherent write can proceed only after the
CO Write's position in the total order of CO Writes is determined. In general, the time
taken for this will depend on the number of processes in the system.
In the future, we can expect the latency of networks that connect distributed systems
to decrease and their bandwidth to increase. But the decreases in latency will be small
relative to the increases in bandwidth. This is because a large part of the latency consists of
instructions in the operating system code at the nodes. How will this aect the performance
of non-coherent memory versus that of coherent memory? Will the dierence disappear?
Our conjecture is that, as long as the time to access local memory is 3-4 orders of magnitude
greater than the time to communicate with a remote process, some applications can improve
their performance by using non-coherent memory. Furthermore, the combination of a high
bandwidth network and a network coprocessor will make the choice of buer size for the
implementation of non-coherent operations a moot issue. This is because writes will not
need to be buered by the process doing the writes. The buering will be done automatically
by the coprocessor which will try to send the messages as fast as it can. It can change the
buer size adaptively depending on the state of the network. The high bandwidth of the
network will be able to accommodate the large number of messages that may be generated.
The latency of the medium will still be a factor.
Multiprocessor desktop workstations are widely available now. The processors in these
systems share memory that is connected to a bus and the hardware protocols guarantee
coherence. A parallel computation on a network of such workstations can make use of the
fact that within processes on the same workstation the memory is coherent. To describe
such computations our model needs to be extended to allow for dierent semantics of an
operation among dierent groups of processes.
Bibliography
[ABC
+
89] E. A. Arnould, F. J. Bitz, E. C. Cooper, H. T. Kung, R. D. Sansom, and
P. A. Steenkiste. The design of Nectar: A network backplane for heterogeneous
multicomputers. In Proc. 3rd ACM Intl. Conference on Architectural Support
for Programming Languages and Operating Systems, Boston, Massachusetts,
April 1989.
[ABHN91] Mustaque Ahamad, James E. Burns, Phillip W. Hutto, and Gil Neiger. Causal
memory. In Proc. of the Fifth International Workshop on Distributed Algo-
rithms, pages 9{30, Oct. 1991.
[AF92] H. Attiya and R. Friedman. A correctness condition for high-performance mul-
tiprocessors. Technical Report #719, Technion { Israel Institute of Technology,
Department of Computer Science, March 1992.
[AH90] Sarita V. Adve and Mark D. Hill. Weak ordering | a new denition. In Proc.
17th Annual International Symposium on Computer Architecture, May 28-31
1990.
[AHJ90] Mustaque Ahamad, Philip W. Hutto, and Ranjit John. Implementing and
programming causal distributed shared memory. Technical Report GIT-CC-
90-49, College of Computing, Georgia Institute of Technology, 1990.
[AHJ91] Mustaque Ahamad, Philip W. Hutto, and Ranjit John. Implementing and pro-
gramming causal distributed shared memory. In Proc. 11th IEEE Intl. Confer-
ence on Distributed Computing Systems, June 1991.
[Bau78] Gerard M. Baudet. Asynchronous iterative methods for multiprocessors. J.
ACM, 25(2):226{244, April 1978.
[BBD
+
87] J. Boyle, R. Butler, T. Disz, B. Glickgield, E. Lusk, R. Overbeek, J. Patterson,
and R. Stevens. Portable Programs for Parallel Processors. Holt, Rinehart and
Winston, 1987.
[BDG
+
91] Adam Beguelin, Jack Dongarra, Al Geist, Robert Mancheck, and Vaidy Sun-
deram. A user's guide to PVM Parallel Virtual Machine. Technical Report
ORNL/TM-11826, Oak Ridge National Laboratory, July 1991.
59
60
[BHG87] Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman. Concurrency
Control and Recovery in Database Systems. Addison-Wesley, Reading, Mas-
sachusetts, 1987.
[Bir91] Kenneth P. Birman. The process group aproach to reliable distributed com-
puting. Technical Report TR-91-1216, Computer Science Department, Cornell
University, July 1991. Revised January 1993. To appear in Communications of
the ACM.
[BJ87] K. Birman and T.A. Joseph. Exploiting virtual synchrony in distributed sys-
tems. In Proc. 11th ACM Symp. on Operating System Principles, Austin, Texas,
Nov. 1987. Also available as technical report TR 87-811 from the Dept. of Com-
puter Science, Cornell Univ., Feb. 1987.
[BKT92] H. E. Bal, M. F. Kaashoek, and A. S. Tannenbaum. Orca - a language for paral-
lel programming on distributed systems. IEEE Trans. on Software Engineering,
18(3):190{205, March 1992.
[BR90] Roberto Bisiani and Mosur Ravishankar. Programming the PLUS distributed-
memory system. In Proceedings of the Fifth Distributed Memory Computing
Conference, IEEE Computer Society Press, Los Alamitos, California, April
8-12 1990.
[BSS91] Kenneth Birman, Andre Schiper, and Pat Stephenson. Lightweight causal and
atomic group multicast. ACM Trans. on Computer Systems, 9(3):272{314,
Aug. 1991.
[BT89] Dimitri P. Bertsekas and John N. Tsitsiklis. Parallel and Distributed Com-
putation: Numerical Methods. Prentice Hall, Englewood Clis, New Jersey,
1989.
[BT90] Dimitri P. Bertsekas and John N. Tsitsiklis. A survey of some aspects of parallel
and distributed iterative algorithms. Technical Report CICS-P-189, Center for
Intelligent Control Systems, Cambridge, January 1990.
[BZ91] Brian N. Bershad and Matthew J. Zekauskas. Midway: Shared memory parallel
programming with entry consistency for distributed memory multiprocessors.
Technical Report CMU-CS-91-170, School of Computer Science, Carnegie Mel-
lon University, September 1991.
[CBZ91] John Carter, John Bennet, and Willy Zwaenopoel. Implementation and perfor-
mance of Munin. In Proc. 13th ACM Symp. on Operating System Principles,,
October 1991.
[CKA91] David Chaiken, John Kubiatowics, and Anant Agarwal. LimitLESS directories:
a scalable cache coherence scheme. In Proc. 4th ACM Intl. Conference on Ar-
chitectural Support for Programming Languages and Operating Systems, Santa
Clara, California, pages 224{234, Apr. 1991. Describes the Alewife machine.
61
[CLL90] Jen-Yao Chung, Jane Liu, and Kwei-Jay Lin. Scheduling periodic jobs that
allow imprecise results. IEEE Transaction on Computers, 19(9):1156{1173,
September 1990.
[DC90] S. E. Deering and D. R. Cheriton. Multicast routing in datagram internetworks
and extended LANs. ACM Trans. on Computer Systems, 8(2), May 1990.
[DSB86] Michel Dubois, Christoph Scheurich, and Faye Briggs. Memory access buering
in multiprocessors. In Proc. 13th Annual International Symposium on Com-
puter Architecture, June 1986.
[Dun92] Trung Dung. Coherence implies sequential consistency. Personal Communica-
tion, October 1992.
[FP89] Brett D. Fleisch and Gerald J. Popeck. Mirage: A coherent distributed shared
memory design. In Proc. 12th ACM Symp. on Operating System Principles,
Litcheld Park, Arizona, ACM Press, December 1989.
[GLL
+
90] Kourosh Gharachorloo, Daniel Lenoski, James Laudon, Phillip Gibbons, Anoop
Gupta, and John Hennessy. Memory consistency and event ordering in scalable
shared-memory multiprocessors. In Proceedings of the 17th Annual Interna-
tional Symposium on Computer Architecture, May 28-31 1990.
[Goo91] James R. Goodman. Cache consistency and sequential consistency. Techni-
cal Report 1006, Computer Sciences Department, University of Wisconsin-
Madison, February 1991. Originally appeared in 1989.
[HA90] P.W. Hutto and M. Ahamad. Slow memory: Weakening consistency to en-
hance concurrency in distributed shared memories. In Proc. 10th IEEE Intl.
Conference on Distributed Computing Systems, Paris, France, June 1990. Also
available as GATech technical report GIT-ICS-89/39.
[KCZ92] Pete Keleher, Alan L. Cox, and Willy Zwaenepoel. Lazy release consistency
for software distributed shared memory. In Proc. 19th Annual International
Symposium on Computer Architecture, Goldwater, Australia, May 19-21 1992.
[Lam78] L. Lamport. Time, clocks and the ordering of events in a distributed system.
Communications of the ACM, 21(7):558{565, July 1978.
[Lam79] Leslie Lamport. How to make a multiprocessor computer that correctly executes
multiprocess programs. IEEE Transactions on Computers, c-28(9), September
1979.
[LH89] Kai Li and Paul Hudak. Memory coherence in shared virtual memory systems.
ACM Trans. on Computer Systems, 7(4):321{359, Nov. 1989.
[Li86] K. Li. Shared Virtual Memory on Loosely Coupled Multiprocessors. PhD thesis,
Yale University, Dept. of Computer Science, Yale University, New Haven, CT.,
October 1986.
62
[LLG
+
90] Daniel Lenoski, James Laudon, Kourosh Gharachorloo, Anoop Gupta, and
John Hennessy. The directory-based cache coherence protocol for the dash
multiprocessor. In Proc. 17th Annual International Symposium on Computer
Architecture, May28-31 1990.
[LLG
+
92] Daniel Lenoski, James Laudon, Kourosh Gharachorloo, Wolf-Deitrich Weber,
Anoop Gupta, John Hennessy, Mark Horowitz, and Monica S. Lam. Stanford
Dash Multiprocessor. IEEE Computer, 25(3):63{79, March 1992.
[LS88] R.J. Lipton and J.S. Sandberg. PRAM: a scalable shared memory. Technical
Report CS-TR-180-88, Princeton Univ., Dept. of Computer Science, Sep. 1988.
[Mat90] Mathematics and Computer Science Division, Argonne National Laboratory.
Using the BBN TC2000, June 1990. Available as ANL/MCS-TM-135.
[MF89] Ronald G. Minnich and David J. Farber. The Mether System: Distributed
Shared Memory for SunOS 4.0. In Proceeding of Usenix-Summer 89, 1989.
[MF90] Ronald G. Minnich and David J. Farber. Reducing host load, network load, and
latency in a distributed shared memory. In Proc. 10th IEEE Intl. Conference
on Distributed Computing Systems, Paris, France, June 1990.
[Min91] Ronald G. Minnich. Mether: A Memory System for Network Multiprocessors.
PhD thesis, University of Pennsylvania, 1991.
[Pap86] C. Papadimitriou. The Theory of Database Concurrency Control. Computer
Science Press, Rockville, Maryland, 1986.
[San90] Jonathan Sandberg. Design of the PRAM network. Technical Report CS-TR-
254-90, Computer Science Department, Princeton University, April 1990.
[Ser90] Dimitrios Serpanos. Scalable Shared Memory Interconnections. PhD thesis,
Princeton University, October 1990.
[vRBC
+
92] Robbert van Renesse, Ken Birman, Rober Cooper, Bradford Glade, and Patrick
Stephenson. Reliable multicast between microkernels. In Proceedings of the
USENIX workshop on Micro-Kernels and Other Kernel Architectures, Seattle,
Washington, pages 269{283, April 1992.
[WW90] W. E. Weihl and Paul Wang. Multi-version memory: Software cache man-
agement for concurrent B-trees. In Proc. IEEE Symposium on Parallel and
Distributed Systems, Dallas, Texas, Dec. 1990.
