Partition Consistency: A Case Study in Modeling Systems with Weak Memory
  Consistency and Proving Correctness of their Implementations by Cheng, Steven et al.
ar
X
iv
:1
30
6.
00
77
v1
  [
cs
.D
C]
  1
 Ju
n 2
01
3
Partition Consistency:
A Case Study in Modeling Systems with Weak Memory
Consistency and Proving Correctness of their Implementations
Steven Cheng, Lisa Higham, Jalal Kawash
University of Calgary
stevechy@gmail.com,{higham,jkawash}@ucalgary.ca
Abstract
Multiprocess systems, including grid systems, multiprocessors and multicore computers, incorporate
a variety of specialized hardware and software mechanisms, which speed computation, but result in
complex memory behavior. As a consequence, the possible outcomes of a concurrent program can be
unexpected. A memory consistency model is a description of the behaviour of such a system. Abstract
memory consistency models aim to capture the concrete implementations and architectures. Therefore,
formal specification of the implementation or architecture is necessary, and proofs of correspondence
between the abstract and the concrete models are required.
This paper provides a case study of this process. We specify a new model, partition consistency,
that generalizes many existing consistency models. A concrete message-passing network model is also
specified. Implementations of partition consistency on this network model are then presented and proved
correct. A middle level of abstraction is utilized to facilitate the proofs. All three levels of abstraction are
specified using the same framework. The paper aims to illustrate a general methodology and techniques
for specifying memory consistency models and proving the correctness of their implementations.
Keywords: weak memory consistency models, distributed-shared memory, sequential consistency, pro-
cessor consistency, correctness of distributed implementations, partial-order broadcast.
1 Introduction
Multiprocess systems including networks, grid systems, multiprocessors and multicore computers, incor-
porate a variety of specialized hardware and software mechanisms such as replicated memory, multi-level
caches, write-buffers, multiple buses, and complex support for message passing. These features help speed
computation by hiding latency and avoiding bottlenecks when accessing shared memory. An unfortunate
consequence is that the possible outcomes of concurrent programs can be unexpected. Because the pro-
cesses’ views of the current state of memory at any moment do not completely agree, the execution of a
concurrent program may not be sequentially consistent.
Sequential consistency is important. It can be efficiently implemented in the absence of data-races; it
supports platform independent code; it is easier to reason about sequential consistency than weaker models.
Nonetheless, real systems typically deviate from implementing sequential consistency, resulting in complex
memory behavior.
It is therefore necessary to formally and precisely specify what computations can arise on a given system.
Such a specification is a memory consistency model. A concrete or low-level memory consistency model
would describe possible outcomes in terms of the behavior of the actual system due to its hardware and
software architecture. For example, consider a system where each process has a write-buffer. The behaviour
of each store(x, ν) instruction could be separated into a sequence of low-level events: first the value ν for
location x is recorded in the local write-buffer, later, the pair (x, ν) is removed from the write-buffer and the
shared memory location x is updated to contain ν. Similarly, the behaviour of each load(x) instruction could
be separated into the events: consult the write-buffer for a pending store to location x; if one exists, return
the associated value; otherwise fetch the value of location x from main memory and return that value. Such
an operational specification in term of the events that occur in the system when an instruction is executed is a
good way to capture exactly what can arise when a concurrent program is executed. A programmer, however,
should be spared having to deal with the architectural details; she should be able to reason about her program
in terms of the instructions of the program, rather than low level events. An abstract memory consistency
model aims to provide such a non-operational specification of the behaviors of concrete architectures and
systems. How can we be sure that such an abstract model is correct? What is required is a specification of
both the abstract model and the concrete architecture and a proof of their equivalence.
This paper provides a case study of this process. We introduce a new general class of memory consis-
tency models, called partition consistency, which captures different degrees of consistency between various
sets of shared variables, as is common in distributed and multiprocessor environments. Sequential consis-
tency [43], Goodman’s processor consistency [5], and the pipelined random-access machine [46] are all
instances of partition consistency. The definition of partition consistency is a natural extension of sequential
consistency. Implementations of partition consistency on the message-passing network model are presented
and proved correct. In contrast to the high-level partition consistency definition, the message passing net-
work is modelled at the level of message sends and receives combined with operations on local memory. It
requires several partial orders to capture the relationships between message operations and local memory
operations. Although partition consistency is an over-simplification, it is non-trivial and it captures some
essential properties of actual implementations.
Our implementations and their proofs of correctness are facilitated by a middle level of abstraction,
which generalizes a totally ordered broadcast model to a partially ordered one. Thus we introduce three lev-
els of abstraction: the abstract partition consistency model, the intermediate partial-order broadcast model,
and the concrete message-passing network model. The same framework is used to define the memory
consistency model of each level. So the intermediate partial-order broadcast model is first the target of
the implementation of the specified partition consistency model. Next, the same partial-order broadcast
abstraction serves as the specified model, which is implemented on the target message-passing network
model. We define the fast-read/slow-write and fast-write/slow-read implementations of partition consis-
1
tency on the partial-order broadcast model. Next we give two implementations of partial-order broadcast
(one token-based and one timestamp-based) on message-passing networks. This results in four compositions
of transformations; each implements partition consistency on message-passing networks.
Proofs of correctness of implementations of shared memory models on multiprocessors or networks
typically involve a great deal of tedious but essential detail, and are thus prone to imprecision and error.
Layering the implementation on levels that are defined using a common framework helps to overcome these
problems by allowing us to focus on only part of the proof obligation at each level. We also introduce
a diagrammatic notation that provides a visual representation of logical statements and inferences. The
precision and conciseness of mathematical logic aids in avoiding some potential ambiguities of consistency
proofs, while its representation in diagram format helps keep our intuitions aligned with the proof. This
notation helped us uncover errors in our initial implementations and our attempted proofs.
Partition consistency is of independent interest because it supports different consistency guarantees for
different partitions of variables. Itanium [39] and Java [49], for example, exhibit differing degrees of consis-
tency depending on how variables are declared. Similarly, abstract consistency models are typically defined
by partitioning operations into classes, where processes have different levels of agreement on the order of
operations in each class. For example, Sequential consistency (SC) [43] requires complete agreement on
all operations; pipelined random-access machine (P-RAM) [46] requires agreement on the write operations
by any individual process; in addition to P-RAM, processor consistency (PC-G) [5] requires agreement on
the operations on the same variables. SC, P-RAM, and PC-G are all special cases of partition consistency.
This paper provides implementations of any instance of the partition consistency class on message-passing
networks with multi-threaded nodes.
Some of the authors have used earlier versions of this framework in previous research. For example,
it was used to expose problems with the Java specifications when applied to long-lived programs [34], to
provide a simple abstract definition for TSO [31], to study the Intel Itanium memory consistency model
[32], and to compare Itanium with the SPARC models [28].
Organization of the rest of the paper:
Section 2 discusses related research on modeling and proof techniques as applied to weak memory consis-
tency models. Section 3 provides the framework and the three levels of abstraction used in this paper, by
defining the memory consistency of each level. The setup for transformations and their proofs of correct-
ness is given in Section 4. The implementations of partition consistency on partial-order broadcast and their
proofs of correctness are given in Section 5. Section 6 (respectively, 7) provides the token (respectively,
timestamp) implementation of partial-order broadcast on the message-passing network model, and the proof
that the implementation is correct. Section 8 concludes by summarizing the paper and discussing future
directions.
2 Related Work
Research concerning systems with weak memory consistency proceeds in several directions. In the follow-
ing we look at those directions that are related to our work. We first review some formalisms used to specify
given systems focusing on the ones that closely resemble our framework. There is a proliferation of abstract
memory consistency models aimed at capturing those relaxations of sequential consistency that are com-
mon in real systems. We concentrate the next part of our review on the subset of these that are instances of
partition consistency. Our work implements partition consistency on the message-passing network model,
so we next discuss its relationship to the large body of research concerned with how to make algorithms
that are correct for sequentially consistent machines remain correct when run on systems with weaker mem-
ory consistency guarantees. The principal goal of this paper is to demonstrate (through the case study of
2
partition consistency) one strategy for proving the correctness of an implementation of an abstract memory
consistency model on a concrete operational model. While an abstract memory consistency model is often
used to provide a non-operational model of a multiprocess system, a proof of the correctness of the proposed
model is less common. Our final review section discusses other instances of this activity, and related proof
techniques.
2.1 Formalisms for modelling systems
There are several formalisms use to specifying concurrent systems — the most common types are based
on process algebras and automata theory. Process algebras start with a basic set of processes, then combine
them into larger systems using various algebraic operators. Communicating sequential processes (CSP) [37]
and the calculus of communicating systems (CCS) [50] are classic examples.
In the input-output automata (IOA) language [48], processes are specified as automata that communicate
by action synchronization. Actions are designated as input, output, or internal, where each input action is
jointly performed with all matching output actions. Actions in IOA are automata transitions that can arbi-
trarily modify local state. Their effect on local state is specified in an imperative language that is essentially
pseudocode. This can make IOA useful for reasoning about algorithms written in imperative programming
languages. The input/output matching is similar to CCS, and the fact that more than two processes can
participate in a communication action is similar to CSP.
The temporal logic of actions, TLA [45], specifies automata using a mathematical language. It avoids
programming languages, making it attractive to hardware designers. TLA also provides many tools for
reasoning about automata in general, and can be used with IOA [51].
Another approach is to regard a system as a collection of processes, where each process produces a
sequence of instructions invocations (program order). A memory consistency model is a set of partial order
constraints on the collection of all instructions of the system. (The model will specify what subsets of
program order must be maintained by which processes (their “views”), and to what degree processes’ views
must agree.) This is the approach used in this paper; it is defined in Section 3. It is most similar to that used
by Steinke and Nutt [59], who observed that all of the models that they were aware of from the literature
could be expressed in terms of the existence of serial views that extend certain partial orders.
2.2 Memory consistency models
There is a large body of literature examining various memory consistency models. Steinke and Nutt [59]
re-specified several models including Goodman’s processor consistency, sequential consistency, pipelined
random-access, and causal consistency in terms of processes’ views, and proved that their definitions are
equivalent to those in the literature. Then, they used their partial order definitions to compare the relative
strengths of models. Specifically, they arranged 12 models into a lattice, with SC being the strongest model.
This research demonstrated a definitional style that is general enough to capture the models in the literature
and to facilitate their comparisons. We now informally describe some specific memory consistency models.
It is straightforward to recast each using the partial order formalism of Steinke and Nutt as adapted in Section
3.
Sequential consistency is a strong memory consistency model introduced by Lamport [43]; it requires
agreement between all system processes on a single view of the all the operations of all processes. This
view agrees with the order in which the instructions producing these operations appear in their programs,
called program order. An even stronger model, atomic objects as defined by Lamport and by Lynch [47],
and linearizability, defined by Herlihy and Wing [27] requires agreement on global timing of operations in
addition to sequential consistency.
3
Lipton and Sandberg [46] introduced a much weaker consistency model than sequential consistency, the
pipelined random-access machine (P-RAM) memory model. P-RAM is an example of distributed-shared
memory (DSM). It requires a process’s view to include its own operations and all other processes’ writes.
The view of a process must be consistent with program order. However, P-RAM allows processes to disagree
on the order of two writes performed by two different processes. As a result, P-RAM is so weak that it cannot
support a solution to mutual exclusion with only read/write variables [35].
Coherence requires that processes agree on the ordering of operations on each object separately, but
not on how the operations on different objects interleave. Coherence is also too weak to support mutual
exclusion with only read/write variables [33]. Coherence is a memory consistency model that captures a
property that we might expect from any multiprocessing system. Nevertheless, some language memory
models, such as Java, are incomparable to coherence.
Goodman’s version of processor consistency (PC-G) [25], as formalized by Ahamad et. al. [5], strength-
ens P-RAM by adding coherence to it. PC-G executions must simultaneously satisfy both P-RAM and
coherence. PC-G is weaker than SC, but it supports mutual exclusion with only read/write variables [35].
Hence, PC-G is one of the few weak models that can be used to implement SC with only read/write variables
[36]. Other versions of processor consistency exist, and these versions are incomparable [60].
In this paper, we introduce partition consistency, which defines a family of memory consistency models
inspired by PC-G, P-RAM, and SC. Each of these three models is a special case of partition consistency.
In addition to such abstract memory consistency models, the literature contains formalizations of the
memory consistency models implemented by concrete multiprocessor machines. Higham, Jackson, and
Kawash explore memory consistency models for SPARC [31] and Itanium [29, 30] multiprocessors, in-
cluding the TSO model, which is claimed to be the consistency model of Intel x86 multiprocessors. The
consistency model for Alpha processors is described in the Alpha manual [20] and is formally defined and
investigated by Attiya and Friedman [7]. The PowerPC consistency model is formalized by Corella, Stone,
and Barton [17]. Sarkar, Sewell, Alglave, Maranget and Williams [55] also aim at faithfully representing the
memory model of POWER multiprocessors. This research defines an abstract machine that implements the
model, and provides programmers with a high level explanation of how the memory model is implemented
by POWER multiprocessors. A large number of manually coded and automatically generated litmus tests are
run on various POWER processors to provide confidence that the hardware behaves as the memory model
predicts. The Intel architecture developer manual provides some description of an x86 memory model
[19], which is also clarified in an Intel whitepaper [18]. Owens, Sarkar, Sewell, Nardelli, Ridge, Baribant,
Myreen, and Alglave studied the memory consistency model of the x86 architecture [57, 56, 52].
It is important to study the behavior of highly used programming languages when there is more than one
thread of execution and to formalize the resulting memory models. The Java memory model was the subject
of a few studies (for instance, see [49], [6], and [34]). The C++ memory model was studied by Boehm
and Adve [12] and recently formalized by Batty, Owens, Sarkar, Sewell, and Weber [11]. The partition
consistency model is simpler than the models arising from such languages; our focus in this paper is on
proving the correctness of implementations of a model.
2.3 Implementations of sequential consistency on systems with weaker memory consistency
guarantees
Many researches have address the question of how to ensure that a program that is correct under sequential
consistency remains correct and efficient under a weaker consistency model.
Attiya and Welch provide DSM implementations for sequential consistency and linearizability[8], and
they examine the difference in these implementations in terms of message delay. Building on Lipton and
Sandberg’s lower bound on sequential consistency [46], Attiya and Welch establish that it is more expen-
sive to implement linearizability than it is to implement SC. Cholvi, Fernandez, Jimenez, and Raynal [16]
4
improve the best-case performance given by Attiya and Welch by showing that sometimes a read can be
guaranteed not to incur any message delay. Cholvi et. al. present a sequentially consistent DSM protocol
that ensures fast writes, but not all reads can be fast. Their implementation uses a single circulated token to
synchronize the processors’ copies of memory.
In this paper, we generalize Attiya and Welch’s total-order broadcast algorithm [8] to a partial-order
one, allowing us to a implement a class of weak memory consistency models. Since our focus is weak
memory consistency, our generalization is based on their implementation for SC, rather than linearizability.
Brzezinski and Szychowiak [14] provide a DSM implementation for PC-G and prove its correctness. It
statically assigns a master node to each variable to ensure coherence. In contrast, our implementations use
a timestamp protocol or circulating tokens and is fully distributed.
Agrawal, Choy, Leong and Singh created the Maya DSM [4] to experiment with weak memory consis-
tency models. Amza, Cox, Dwarkadas, Keleher, Lu, Rajamony, Yu, and Zwaenepoel implemented the weak
memory consistency model release consistency in their Treadmarks DSM [40]. Adve introduced data-race-
free (DRF) programs [2, 3]. DRF programs have that property that there are no data races in any sequentially
consistent execution, which can be achieved by insertion of appropriate synchronization instructions. DRF
programs are guaranteed to yield sequentially consistent computations on several weak models including
release consistency [23, 22].
Shasta and Snir [58] implement sequential consistency on MIMD machines such as the NYU Ultra-
computer and IBM RP3. In such machines, a packet-switched network connects processors to multi-ported
memory modules that can be simultaneously accessed. Their goal is to gain efficiency by exploiting potential
simultaneous accesses to memory. To ensure that sequential consistency is not violated, control instructions
are added to delay accesses until the previous one by the same processor is completed, and synchronization
code (locks) are used to deal with cases when some memory accesses need to have stronger atomicity than
the word-level atomicity provided by the machine. For efficiency, it is important to the use these constructs
only when necessary. Analysis of interdependence of processes is used to minimize their use. By doing this
analysis of a program before it is executed, they show that their implementation “requires far less locking
than database control theory would lead one to expect”. Their proofs are primarily set-theoretic in structure.
In this paper, we similarly implement sequentially consistency (and other models) but on a message-passing
platform.
Kuperstein, Vechev and Yahav [41] developed an algorithm and its implementation (Fender) that infers
where memory fences are needed to maintain correctness. Fender takes as input a finite state program, a
safety specification and a memory model described by a transition system. For each state, Fender computes
an avoid formula that captures all the ways to prevent an execution from reaching the state. Once transitions
to invalid states are identified, provided they can be avoided by local fences, such fences can be inserted to
ensure that the invalid states are not reachable. This approach is distinctly different from the approach of this
paper. It uses an operational definition of the memory consistency model, and a state-based notion of safety,
whereas we use a partial-order definition of computations and a predicate on computations to define safety.
Our techniques, however, do not provide an automated way to infer where and what kind of synchronization
is required.
The CheckFence tool of Burckhardt, Alur, and Martin [15] is another tool to ensure that programs
remain correct when executed under weak memory models. It takes as input a program written in a subset
of C and an axiomatic memory model, and determine if there is an execution that violates its specification.
CheckFence works by combining the axiomatic memory model definition with a compiled version of input
program to to form a boolean satisfiability problem. It then calls a SAT solver to find violating executions
for finite unrolling of the program. Like Fender, the advantage of CheckFence is that much of the work of
checking correctness is automated. It does not, however, directly shed much light on how to fix a program
that does permit executions that violate the specification.
Alglave and Maranget [55] provide a class of memory models that can be instantiated to produce se-
5
quential consistency, Sparc-TSO and a model based on Power processors. They provide theorems on barrier
placement needed to regain sequential consistency and a tool called diy to automatically generate litmus
tests that detect relaxations of SC. They use a specification style that is similar to what we use in this paper.
Huseynov’s Distributed Shared Memory webpage [38] tracks the available academic and commercial
DSM implementations.
In contrast with all of these papers, this paper is concerned with developing techniques to prove that a
given (weak) abstract memory consistency model is a correct abstraction of a given architecture.
2.4 Proofs of correctness of concurrent systems
There are several methods to prove the correctness of programs. Hoare established a logic-based approach,
which proceeds by showing that a program satisfies specified post-conditions given that its satisfies specified
pre-conditions (see Backhouse [10]). Owicki and Gries extended Hoare’s logic to apply to multiprocessor
systems (see Feijen and van Gasteren [21]). Reynolds generalized Hoare’s logic to separation logic [54],
which facilitates proofs of programs because it allows reasoning about parts of memory independent of the
entire global state. Concurrent separation logic [13] combines these two extensions. It is aimed at reasoning
about concurrent programs and their shared mutable data structures.
CCS [50] and pi-calculus [1] establish correctness proofs by simulating one automaton with another.
Proofs proceed by showing that the properties of a simulation imply the specifications. In these approaches,
a system execution is a sequence of events. Lamport’s temporal logic of actions (TLA) [45] works similarly
except that an execution in TLA is a sequence of states. Lamport’s system executions framework [44]
uses a collection of events together with a happens-before partial order and a can-causally-affect relation,
where some general axioms for system executions are satisfied. A proof of correctness is constructed by
defining a mapping of sets of events to an abstract new event. This mapping induces a happens-before
order and a can-causally-affect relation on the set of new abstract events. Hence, a new abstract (high-level)
set of system executions are produced from a set of concrete (low-level) system executions. The proof
is completed by showing that these high-level executions satisfy the specification. These high-level and
low-level descriptions use the same mathematical language, allowing the abstraction process to be applied
repeatedly. Lamport’s system executions framework cannot be easily adapted to weak memory consistency
models. To more naturally capture weak memory consistency models, we typically use more than one partial
order in addition to agreement properties on these orders.
Lamport’s system executions framework does not specify how these executions are generated. Gischer
[24] and Pratt [53] address this problem. The execution of an entire program is described as a collection of
partially-ordered sets. These posets are generated by applying process algebra operations, such as sequential
and parallel compositions, to smaller programs. Thus, a set of posets that represent all system executions
can be recursively constructed. Then, simulation techniques are used to prove that the system executions
match the specifications.
Many proofs of shared-memory algorithms assume Herlihy and Wing’s strong consistency model, lin-
earizability [27]. A particularly useful and powerful property of linearizability is its locality: proving sep-
arately that the implementation of each object in the system is linearizable implies the correctness of the
whole system implementation. Usually weak memory consistency models do not have such a locality prop-
erty, complicating proofs of correctness.
Aspinal and Sevcik use a partial order formalism to represent the Java Memory Model (JMM) [6] in
order to prove that data-race-free programs produce sequentially consistent executions on the Java Virtual
Machine. The partial order constraints of JMM are used to produce a sequentially consistent total order.
Our partial order modeling shares similarity with theirs, but our low-level partial order constraints are used
to produce computations that satisfy the partial order constraints of (the high-level) partition consistency,
rather than a single total order.
6
3 Definitions and Models
An operation consists of an operation invocation and an operation response often involving shared objects.
We use completed operation to emphasize that an invocation is paired with a response. A thread generates
a sequence of operation invocations in a sequence called program order. A process consists of a finite
collection of threads. A multiprogram is a finite collection of processes.
A computation of the multiprogram is formed by arbitrarily completing each operation invocation, in
each individual thread sequence, with a response, creating a collection of sequences of completed opera-
tions. Program order on operation invocations is naturally extended to define program order on the set of
completed operations of a computation. That is, a computation consists of a set of sequences of operations,
one sequence of operations for each thread of each process. We denote this unrestricted set of computations
of a multiprogram P by C (P ). The subset of C (P ) that could actually result from the execution of the mul-
tiprogram depends upon the distributed system’s architecture. A memory consistency model is a predicate
defined on the set of all possible computations of a multiprogram; it filters these computations to include
only those that could arise on the architecture being modeled. The subset of C (P ) that satisfies the memory
consistency predicate, MC, is denoted C(P,MC).
We use the following notation, terminology and conventions for the rest of the paper. For a computation
C of a multiprogram P , OC denotes all the operations of C . A completed operation OPER with input u that
returns a value v is denoted OPER(u)
v
. For a set of operations O, O|wrts(S) denotes the subset of all write
operations to variables in S; if S is all the variables, we write O|wrts; O|p denotes the subset of all operations
by process p ∈ P . The program order relation on OC , denoted
progC−−−→, is the partial order formed by the
union of the individual thread program orders1. For all these notations we omit the subscript C when it is
obvious. When we need the individual program order for a particular process or thread p, we write
progp
−−→.
The style pred[args] is used to denote a predicate. Given relations R−→, T−→, and a set A, define extension and
agreement predicates on relations by:
• Extends[A, R−→, T−→] def= ∀a1, a2 ∈ A : a1
T
−→ a2 =⇒ a1
R
−→ a2
• Agree[A, R−→, T−→] def= ∀a1, a2 ∈ A : (a1
R
−→ a2)⇐⇒ (a1
T
−→ a2).
Given a total order on a finite set, there is only one sequence of all the elements of the set that realizes that
total order. Therefore, we sometimes overload the term total order for a finite set A: it refers to either the set
of ordered pairs (A, T−→) in the order, or the sequence, which we denote by T , that realizes that total order.
The notation 〈 xa : a ∈ A 〉 specifies a collection of items xa, exactly one for each a ∈ A.
The most common shared objects for this paper are variables with the sequential specification [27]: a
sequence of READ and WRITE operations on a variable x is valid if each READ(x) returns the value written
by the most recent preceding WRITE(x, ·) in the sequence (or the initial value if no such WRITE exists). Other
shared objects will be defined later as needed. Any sequence of operations on a collection of objects is valid
if, for each object, the subsequence of operations on that object is valid.
The technical results of this paper concern three memory consistency models, called the partition con-
sistency model, the partial-order broadcast model, and the network model. We use these terms to describe
the abstract machine that delivers the consistency guarantees, but when we need to emphasize that these
models are actually predicates on computations, or when we need to denote them within other notation, we
use the abbreviated predicate forms, PC, POB, and NW respectively.
The partition consistency model defines a class of abstract memory consistency models that is designed
to capture processes that communicate by reading and writing shared variables. It requires each process to
1Since p could be multithreaded, (OC |p,
prog
−−→) is not necessarily a total order.
7
“see” its own operations in addition to all other processes’ writes in a valid total order. This order must
extend program order. In addition, the views of all processes may be required to agree on the ordering of
some specified subsets of operations. More formally, let K = {V1, . . . , Vm} be a partition of a subset of the
set V of shared variables.
Definition 3.1. PC(K)[C] def= ∃ 〈 valid total order (OC |p ∪OC |wrts,
Lp
−→) : p ∈ P 〉 satisfying
(∀p ∈ P : Extends[OC |p∪OC |wrts,
Lp
−→,
progC−−−→]) and (∀p, q ∈ P, i ∈ [1,m] : Agree[OC |wrts(Vi),
Lp
−→,
Lq
−→]).
Different instantiations of K yield different memory consistency models including several well-known
models. For example, Sequential Consistency requires that all processes agree on a single valid total order
that extends program order. Thus, SC is PC({V }). In the pipelined ramdom-access model every process
“sees” all the writes of each other process in program order, but different processes can interleave these se-
quences differently, so there is no additional agreement beyond program order on the write operations. Thus,
P-RAM is PC(∅). Goodman’s Processor Consistency requires that, in addition to P-RAM, for each shared
variable, processes agree on the order of all operations to that variable. Thus PC-G is PC({{v} | v ∈ V }).
A variable is a single-writer variable if it can be written by only one process, otherwise it is a multi-
writer variable. The multi-writer variable subset of V is denoted V |multi-wrtrs. If {x} ∈ K and x is a
single-writer variable, then the Agree property for the set {x} holds automatically because write operations
on x are totally ordered by program order, and program order is preserved by the Extends property. Thus {x}
can be removed from K while maintaining PC(K). Because implementations spend resources to maintain
the consistency of each set in K , removing {x} from K could reduce partition maintenance overhead in an
implementation. This motivates two new natural instantiations of partition consistency,
• WeakPC-G def= PC(G) where the partition G is given by G = {{v} | v ∈ V |multi-wrtrs}, and
• WeakSC def= PC({V |multi-wrtrs}).
By the previous observation, WeakPC-G is equivalent to PC-G; however WeakSC is strictly weaker than SC
though still stronger than PC-G. Our preliminary investigation suggests that, for many programs, WeakSC
is equivalent to SC. Yet, in our implementation, it can be substantially more efficient than SC.
The message-passing network model captures a concrete reliable, message-passing asynchronous net-
work of multi-threaded processes. Each process has a set of locally shared variables, which threads within
that process use to communicate with each other. The accesses to locally shared variables are sequentially
consistent2 . Threads of distinct processes communicate by sending and receiving messages, where messages
from a sender to a receiver are received in the order sent.
This intuition of a network is formalized as follows. The shared objects are variables (shared between
threads of the same process) and messages (shared between different processes). Messages have distinct
identifiers, and support the operations send SEND(s, d,m) and the receive RECV()
s,d,m
, where s, d,m are the
source, destination, and message contents respectively. A sequence of message operations is valid if it
contains at most one SEND and at most one RECV of any message. We assume that for each local variable
of a process, each write to that variable is distinct. (If this is not the case, the process can add sequence
numbers to make them so.) Define the following relations on the set OC of operations of a computation C:
Message causality: (a message is received after it is sent)
x
MessageOrderC−−−−−−−→ y
def
= x, y ∈ OC ∧ x = SEND(s, d,m) ∧ y =
RECV()
s,d,m
2Weakening this assumption of “local sequential consistency” is possible. It only requires some additional thread synchroniza-
tion. Since this would add complication local to each process without otherwise changing the results of this paper, we do not include
this option in the rest of this paper.
8
FIFO channel causality: (two messages sent in program order to the same receiver are received in that
order)
x
FifoChannelC−−−−−−→ y
def
= x, y ∈ OC ∧ x =
RECV()
s,d,m
∧ y = RECV()
s,d,m′
∧ SEND(s, d,m)
prog
−−→ SEND(s, d,m′)
Writes-into causality for variables: (a value read from a variable must have been previously written to it)
x
WritesIntoC−−−−−→ y
def
= x, y ∈ OC ∧ x = WRITE(w, z) ∧ y =
READ(w)
z
Happens-before: (operations happen in an order that observes the message, FIFO channel, and writes-into
causalities)
HappensBeforeC−−−−−−−→
def
= (
progC−−−→ ∪
MessageOrderC−−−−−−−→ ∪
FifoChannelC−−−−−−→ ∪
WritesIntoC−−−−−→)+.
The definition HappensBeforeC−−−−−−−→ is inspired by Lamport’s happens-before [42], but that definition considers se-
quential processes that communicate only by message passing. This definition adds shared memory com-
munication between threads and is designed to incorporate weak consistency.
Definition 3.2. NW[C] def= ∃ 〈 valid total order (OC |p,
Lp
−→) : p ∈ P 〉 satisfying
(∀p ∈ P : Extends[OC |p, Lp−→, HappensBeforeC−−−−−−−→]) and (RECV()s,d,m ∈ OC if and only if SEND(s, d,m) ∈ OC).
This definition captures what we would expect of a reliable message-passing network that has FIFO
channels between each pair of processors. It requires that each process’s view of its own operations is
consistent with HappensBeforeC−−−−−−−→ order. Thus, if two operations by threads of process p are causally ordered, and
even if the intermediate operations that cause that ordering are not visible to p, there must be a valid view of
all p’s operations that does not conflict with that causal ordering. (Without this property, the model could,
for example, allow a computation where process p receives m1 from q, then does some local computation,
then sends m2 to q, and process q receives m2 from p, then does some local computation, then sends m1
to p. Since such a computation could not occur on a network, it should fail to satisfy the NW memory
consistency predicate.) The last conjunct ensures that the received messages are exactly those that are sent.
Any instance of the partition consistency could be constructed directly on the message-passing network
model. We obtain cleaner proofs and better abstraction, however, by introducing an intermediate level that
isolates the fact that processes broadcast write updates and apply them locally without the details of how
broadcasting is managed.
The partial-order broadcast model is designed to capture a collection of multithreaded processes, where
threads within each process communicate through shared variables, and updates are communicated between
distinct processes using a one-to-all BCAST and a corresponding DELIVER. Every process delivers updates
in an order that extends the program order of the corresponding broadcasts. Furthermore, updates can be
labeled. Processes agree on the delivery order of all updates with the same label; such agreement is not
required for differently labeled updates.
The formal definition of this model uses variables (shared between threads of the same process) and
update objects (shared between different processes). Each update object is unique and supports the operation
BCAST(u, l) (broadcast update u with label l to all) and the operation DELIVER()
u,l
(deliver the update u). For
unlabeled updates, the label has the null value, denoted ⊥. The delivery order of unlabeled updates is not
constrained beyond program order. A sequence of BCAST and DELIVER operations is valid if 1) no DELIVER
precedes its corresponding BCAST (This restriction does not require the corresponding BCAST to be in the
valid order), and, 2) no specific DELIVER occurs more than once. Define the deliver relation (a partial order)
on the set OC of operations of a computation C of the partial-order broadcast model by:
x
delOrderC−−−−→ y
def
= x, y ∈ OC ∧ x =
DELIVER()
u1,l1
∧ y = DELIVER()
u2,l2
∧ BCAST(u1, l1)
progC−−−→ BCAST(u2, l2).
Let OC |delivers(l) denote the set of all DELIVER operations returning an update with label l 6= ⊥.
9
Definition 3.3. POB(L)[C] def= ∃ 〈 valid total order (OC |p,
Lp
−→) : p ∈ P 〉 satisfying
(∀p ∈ P : Extends[OC |p,
Lp
−→,
progp
−−→ ∪
delOrderC−−−−→]) and (∀p, q ∈ P, l ∈ L : Agree[OC |delivers(l),
Lp
−→,
Lq
−→])
and ∀p(BCAST(m, l) ∈ OC if and only if DELIVER()m,l ∈ OC |p).
This definition captures what we described as the intermediate partial-order broadcast model. It requires
that each process’s view of its own operations is a valid sequence that extends the program order of its
own threads and delivers message according to the program order of the corresponding broadcasts. It also
requires agreement between process’s views of delivers of updates with the same label. The last conjunct
ensures that every process delivers exactly the updates that were broadcast.
4 Setup for Transformations and Proofs
4.1 Transformations setup
We implement any partition consistency model S on a network model N indirectly. We first implement S
on a partial-order broadcast model T and then implement T on N . For each of these two steps we provide
two different implementations, and prove each one is correct. All four of the resulting proofs have similar
structure and notation, which is described in this section.
All our implementations transform code for a specified model to code for a target model. For clarity,
SMALL CAPS font is used to denote specification level operations; Teletype is used to denote target level
operations. To emphasize that a component belongs to the target level its name is sometimes annotated with
a “hat” as in n̂ame.
4.2 Proofs setup and structure
Our transformations convert some specified multiprogram into a target multiprogram. To prove correctness,
we must show the possible computations of these two multiprograms that can arise from their respective
memory consistency models, have the same “outcome”. We make this precise as follows. Let τ(P ) denote
a transformation of multiprogram P . The possible computations of multiprogram P (respectively, τ(P ))
on the specified (respectively, target) memory consistent model MC (respectively, M̂C) is the set C(P,MC)
(respectively, C(τ(P ), M̂C)). But τ(P ) transforms specified operation invocations that require a response
into subroutines that return a response. So these returned responses can be used to interpret each compu-
tation in C(τ(P ), M̂C) as a computation of P . We need to show that each such interpreted computation
could have arisen in the specified model. That is, we must show that the interpretation of any computation
in C(τ(P ), M̂C) is in C(P,MC). If this is satisfied for any P , we say that τ(·) correctly implements MC on
M̂C. Figure 1 depicts this proof obligation.
P
τ

generates
//❴❴❴❴❴❴❴❴❴ C(P,MC) Ishow ⊇❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴
τ(P ) generates
//❴❴❴❴❴❴❴❴ C(τ(P ), M̂C)
interpret
55❦❦❦❦❦❦❦❦❦
Figure 1: Proof obligation for establishing correctness of transformations
Each of our memory consistency models in Section 3 is defined by requiring that for every computation
there is a collection of sequences of its operations such that
• each sequence is valid,
10
• each sequence extends some partial orders, and
• the set of sequences together satisfy some agreement constraints.
So we show that a computation C satisfies a memory consistency model MC by constructing such a
collection of sequences that jointly satisfy the MC’s constraints. Any such collection is said to witness that
C satisfies MC, and is informally referred to as a collection of witness sequences. More formally, we use the
predicate Witnesses[A,C,MC] to assert that the collection of sequences A witnesses that C satisfies MC.
Let P be a specified program and τ(P ) be a transformation of that program. All our proofs have the
following structure:
Assume: Ĉ ∈ C(τ(P ), M̂C). Let C ∈ C (P ) be the interpretation of Ĉ.
Build: Choose any collection of sequences Â such that Witnesses[Â, Ĉ, M̂C]. Use Â to construct a corre-
sponding collection of sequences A for the operations in C .
Verify: Show that Witnesses[A,C,MC].
In this paper, we consider only finite computations of the specified system that are completed in the
target system. For long-lived computations, we would need to consider computations that arise when a
multiprogram is part way through its execution and extensions of such computations as the multiprogram
continues to execute. Furthermore, the transformed multiprogram could be in a state where some processes
are part way through executing the transformation of their current operation invocation. Such an operation
invocation is incomplete. For example, a process could have sent some but not all messages required in
the transformation of its current operation invocation, and messages sent by a process could be received by
some recipients but not by others. The problem of incomplete operations is taken care of by generalizing
the technique used to show that a computation is linearizable as introduced by Herlihy and Wing [27].
That is, in the Assume step, we are allowed to adjust the computation Ĉ so that every operation in the
adjusted computation is complete before proceeding with the Build step that extracts the sequences Â, and
uses them to construct the witness sequences A. This is done as follows. For every incomplete operation
in Ĉ , either all its steps are erased or remaining steps are added so that it is complete. Such adding or
erasing of steps could also change the operations of other processes since they may be receiving and acting
on messages sent by operations that were incomplete. So the steps of these operations are also either
erased or completed. To take care of the problem that the multiprogram is long-lived, we must also show
that the witness sequences constructed for a computation, say C , of the system are not messed up by the
witness sequences that are constructed for an extension of that computation, say C ′, as the system continues
to execute. This is done by showing that the collection A of sequences that are constructed to establish
Witnesses[A,C,MC], are each prefixes of the corresponding collection A′ of sequences that are constructed
to establish Witnesses[A′, C ′,MC]. For the proofs in this paper, these two tasks are straightforward, but add
considerable notational overhead. We leave it to the reader to observe that the results hold for long-lived
computations but consider only finite computations with only completed transformations in this paper.
Proof diagrams
When designing and debugging our proofs, we frequently used diagrams to record partial orders and various
relationships between them. Because these diagrams could be formalized and used to help make our proofs
more precise and concise, we adopt this diagrammatic notation here. The notation use in these diagrams is
as follows. Let a, b, c and d be operations; and A and B be set of operations. Edges in a diagram represent
boolean expressions and a diagram is interpreted as the conjunction of these expressions. The basic building
blocks are:
11
diagram symbol asserts diagram symbol asserts
a b a = b A
 B A ⊆ B
a
L // b a L−→ b A
L // B ∀a ∈ A, b ∈ B : a L−→ b
A
✤ //❴❴❴ B ∃b ∈ B : ∀a ∈ A : a L−→ b A ✤//❴❴❴ B ∃a ∈ A : ∀b ∈ B : a L−→ b
a
R ❴❴❴ b a∼
R
b: where R is relation a L //

b
c
M
// d
(a
L
−→ b) =⇒ (c
M
−→ d)
Multiple edges are disjunctive; eg. a D 66
E
((
b asserts (a D−→ b or a E−→ b). The position of an arrow and
its label are not significant; eg. a
L
// b and b aLoo are equivalent. Notice that for sets A and B, the
notation is similar to Lamport’s system executions [44].
5 Implementing Partition Consistency on the Partial-Order Broadcast Model
Partition consistency models processes that interact by reading and writing globally shared variables that
have only weak consistency guarantees. Recall that an instance of partition consistency has a partition K
of some of the variables and requires strong agreement within but not between sets of K . Our task is to
transform each specified process p of such a system into a target process p̂ for the partial-order broadcast
model, where inter-process communication is via the partial order broadcast primitive. Our implementation
is a generalization of the way that totally ordered broadcast is used to implement sequential consistency [9].
The predicate PC(K) for partition consistency ensures agreement on the ordering of writes to all variables
within the same class of the partition K; the consistency predicate POB(L(K)) satisfied by the partial-order
broadcast model is used to implement this agreement by enforcing agreement on the deliveries of updates
with the same label.
We achieve our implementation by:
• creating a label for each class in partition K; that is, L(K) = {i : Vi ∈ K};
• mapping each p to a thread p̂.m in the partial-order broadcast model; and
• augmenting each p̂.m with a companion delivery thread p̂.d.
There are two variants of the transformation. The pseudo-code for the slow-write/fast-read variant (SWFR)
is shown in Figure 2.
The main thread, p̂.m, is derived from p by replacing each READ and each WRITE to a shared variable
with a subroutine call. The transformation of a READ simply returns the value stored in p̂’s local memory.
The transformation of a WRITE creates a bcast operation to be delivered to each target process. It has a
label corresponding to the partition class of the variable being written, if it exists. The delivery thread, p̂.d,
manages the deliver operations and maintains synchronization with p̂.m via locally shared variables. It
repeatedly applies updates to the local memory it shares with p̂.m. The procedure WaitWritesComplete
causes p̂.m to wait until p̂.d has applied all the WRITEs previously broadcast by p̂.m.
Under the SWFRPC(K)POB(L(K)) transformation, each process has at most one outstanding local write, since
every write must be applied locally before the subroutine completes. Every write contains a wait, making
these writes “slow”. An alternative is to move the WaitWritesComplete call from the end of the WRITE to
the beginning of the READ. This gives us a fast-write/slow-read (FWSRPC(K)POB(L(K))) implementation.
5.1 Correctness of the SWFRPC(K)POB(L(K)) and FWSRPC(K)POB(L(K)) implementations
The proofs of correctness of SWFRPCPOB and FWSRPCPOB are very similar; they differ in only one step. This step
can be treated generically, so we present one proof for both implementations. Let WRPC(K)POB(L(K)) refer to either
12
SWFRPC(K)POB(L(K)) Implementation;
Code for each process p ∈ P .
1. Transformation’s local target variables
Memory [p̂].x ∀x ∈ V , local replica variable, initial value is the initial value of x
writes-processed local variable, counts messages BCASTed by p that are also DELIVERed by p, initially 0
writes-requested local variable, counts locally BCASTed messages, initially 0
2. Transforming specification processes
Transformation of process p to thread p̂.m :
SWFRPC(K)POB(L(K))(READp(x))
1 return Memory [p̂].x
SWFRPC(K)POB(L(K))(WRITEp(x, v))
2 writes-requested ← writes-requested +1
3 if ∃Vi ∈ K : x ∈ Vi
4 then bcast([x, v, p̂], i) else bcast([x, v, p̂],⊥)
5 WaitWritesComplete()
WaitWritesComplete()
6 while writes-processed <
7 writes-requested
8 do skip
3. New target threads
∀p ∈ P , new thread p̂.d.
p̂.d :
9 while TRUE
10 do ApplyWrite()
ApplyWrite()
10 update, l ← deliver()
11 { update has form [x, v, source]}
12 let [x, v, source] = update
13 Memory [p].x← v
14 if source = p
15 then writes-processed ← writes-processed +1
Figure 2: Implementation of Partition Consistency on the Partial-Order Broadcast Model
implementation.
Theorem 5.1. Let P be any multiprogram where each process in P is a single-threaded program that
accesses read/write variables from a set V , and let K = {V1, . . . , Vk} be any partition of a subset of V .
Then WRPC(K)POB(L(K))(P ) correctly implements PC(K) on POB(L(K)), for all such P .
Proof.
Assume: Let Ĉ be a computation in C
(
WRPC(K)POB(L(K))(P ),POB(L(K))
)
and let C be the interpretation of
Ĉ. Let Ô denote the set of operations O
Ĉ
. To show PC(K)[C] we construct witness sequences that satisfy
the requirements of Definition 3.1.
Build: Choose a collection of witness sequences 〈 (Ô|p̂,
L̂p̂
−→ ) : p̂ ∈ WRPC(K)POB(L(K))(P ) 〉. That is,
Witnesses[{L̂p̂ : ∀p̂ ∈ P̂}, Ĉ,POB(L(K))].
Construct a corresponding set of sequences 〈 (OC |p ∪OC |wrts,
Lp
−→) : p ∈ P 〉 as follows:
13
1. For each read or write operation o on a “local replica” variable we associate it with a specification
level operation as follows:
(a) The transformation sets up a one-to-one correspondence between the set of read operations of
the POB(L(K)) system, and the READ operations of the PC(K) system. Specifically, read(Memory [p].x)
v
in the implementation must have come from the transformation of a unique corresponding spec-
ification level READ(x)
v
∈ OC |p.
(b) The transformation sets up, for each p̂, a one-to-one correspondence between the set of write
operations in Ô|p̂ of the POB(L(K)) system, and the WRITE operations of the PC(K) sys-
tem. More precisely, every write(Memory [p].x, v) must have a od = DELIVER()[WRITE,x,v] in the same
ApplyWrite() call. This deliver operation od must have a corresponding BCAST(m), which can
only have occurred in the transformation of some unique corresponding specification level write
WRPC(K)POB(L(K))(WRITE(x, v)).
2. For each sequence L̂p̂, build the sequence Short(L̂p̂) by removing all of operations that are applied to
the broadcast object, and the local variable writes-processed and the local variable writes-requested .
This leaves only the read and write operations on the “local replica” variables in local memory.
3. For each L̂p̂, build a sequence Lp by replacing each read (respectively, write) operation in Short(L̂p̂)
with the associated specification level READ (respectively, WRITE) operation defined in step 1.
4. Notice that each sequence Lp contains exactly the operations in OC |p ∪ OC |wrts. These sequences
induce the corresponding total orders 〈 (OC |p ∪OC |wrts,
Lp
−→) : p ∈ P 〉.
Verify: We need to prove that Witnesses[{Lp : ∀p ∈ P}, C,PC(K)] holds for the orders
〈 (OC |p ∪OC |wrts,
Lp
−→) : p ∈ P 〉 constructed in the Build step.
First observe that removing from any valid sequence all operations related to specific objects preserves
validity. Hence each Short(L̂p̂) is valid since L̂p̂ is valid. Then replacing each read and write with
the corresponding READ and WRITE respectively also clearly preserves validity. We conclude that the total
orders (O|p ∪O|wrts, Lp−→) are valid. The following two lemmas establish the remainder of the proof:
Constraint Lemma
Extends[OC |p ∪OC |wrts,
Lp
−→,
progC−−−→] Lemma 5.2
∀p, q ∈ P, i ∈ [1, k] : Agree[OC |wrts(Vi),
Lp
−→,
Lq
−→] Lemma 5.3
Therefore, PC(K)[C] as required.
Lemma 5.2. ∀p ∈ P : Extends[O|p ∪O|wrts, Lp−→, progC−−−→].
Proof. Let o1 prog−−→ o2 where o1, o2 ∈ O|p∪O|wrts. PC(K) only has read/write variable objects, so there are
four cases for o1, o2. Notice that if o1 or o2 is a READ then o1, o2 ∈ O|p.
Case 1: read, read
Let o1 = READ(x)v and o2 =
READ(y)
w
.
WRPC(K)POB(L(K))(READ(x))
p̂rogp
−−→ WRPC(K)POB(L(K))(READ(y)) =⇒
read(x)
v
p̂rogp
−−→ read(y)
w
=⇒
read(x)
v
L̂p̂
−→ read(y)
w
=⇒ o1
Lp
−→ o2.
Case 2: read, write
Let o1 = READ(x)v and o2 = WRITE(y,w). We have:
14
READ(x)
v
WRPC(K)POB(L(K)) ✤
✤
✤
progp // WRITE(y,w)
WRPC(K)POB(L(K)) ✤
✤
✤
WRPC(K)POB(L(K))(o1)
?
p̂rogp //WRPC(K)POB(L(K))(o2)
?
read(x)
v
p̂rogp //

bcast(o2)
L̂p̂1. //

deliver()
o2
p̂rogp //

write(Memory[p].y, w)
read(x)
v
L̂p̂
//
corresponds
✤
✤
✤
bcast(o2)
L̂p̂
// deliver()
o2 L̂p̂
// write(Memory[p].y, w)
corresponds
✤
✤
✤
READ(x)
v Lp
// WRITE(y,w)
1. By definition of validity of update objects (since bcast and deliver are both in O|p).
Case 3: write, write
Let o1 = WRITE(x, v) and o2 = WRITE(y,w). Suppose o1, o2 ∈ O|q. Then:
WRITE(x, v)
progq //
WRPC(K)POB(L(K)) ✤
✤
✤

WRITE(y,w)
WRPC(K)POB(L(K)) ✤
✤
✤
WRPC(K)POB(L(K))(o1)
?
p̂rogq //

WRPC(K)POB(L(K))(o2)
?
bcast([x, v, q̂], )
p̂rogq //

bcast([y,w, q̂], )
deliver()
[x,v,q̂],
̂delOrder //

deliver()
[y,w,q̂],
ApplyWrite
 _ p̂rogp //

ApplyWrite
 _
write(Memory [p̂].x, v)
?
p̂rogp //

write(Memory [p̂].y, w)
?
write(Memory [p̂].x, v)
corresponds
✤
✤
✤ L̂p̂
// write(Memory [p̂].y, w)
corresponds
✤
✤
✤
WRITE(x, v)
Lp // WRITE(y,w)
Case 4: write, read Let o1 = WRITE(x, v) and o2 = READ(y)w . Then:
15
WRITE(x, v)
progp //
WRPC(K)POB(L(K)) ✤
✤
✤
READ(y)
w
WRPC(K)POB(L(K)) ✤
✤
✤
WRPC(K)POB(L(K))(o1)
?
p̂rogp̂ //WRPC(K)POB(L(K))(o2)
?
write(writes-requested , c1)
p̂rogp̂

bcast([x, v, p̂], )
p̂rogp̂ //
L̂p̂1.

read(writes-processed )
c2
p̂rogp̂ // read(writes-requested)
c2:c2≥c1
p̂rogp̂ // read(y)
w
corresponds
✤
✤
✤
✤
✤
✤
✤
✤
✤
✤
✤
✤
write(writes-processed , c1)
3. L̂p̂
OO
deliver()
[x,v,p̂], p̂rogp̂
// write(x, v)
2. p̂rogp̂
OO
L̂p̂
55❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦
corresponds
✤
✤
✤
WRITE(x, v)
Lp
// READ(y)
w
1. By validity of L̂p̂.
2. Both fast-write and slow-write transformations call WaitWritesComplete between any write and
subsequent read by the same process.
3. c2 ≥ c1 and writes-processed is increased every time it is set. If a value greater than c1 is read from
writes-processed , it must have been written after write(writes-processed , c1).
Thus in all cases we have o1
Lp
−→ o2. Therefore
Lp
−→ extends prog−−→.
Case 4 exemplifies how the proofs for both SWFRPCPOB and FWSRPCPOB are unified. The only property
needed is that there is a WaitWritesComplete between the BCAST and the READ()
.
Both SWFRPCPOB and
FWSRPCPOB satisfy this property and so one proof suffices for the correctness of both implementations.
Lemma 5.3. ∀p, q ∈ P, i ∈ [1, k] : Agree[O|wrts(Vi),
Lp
−→,
Lq
−→].
Proof. Let WRITE1,WRITE2 ∈ O|wrts(Vi) for some i. Then the transformation WRPC(K)POB(L(K))(WRITEi) (i ∈
{1, 2}) of each of these contains a corresponding broadcast bcast(msgi, l) where the label l is the same
for each. For every process p, L̂p̂ contains a deliver followed by a write corresponding to each of
these bcast’s. Let w1p, w1q, w2p, w2q be these corresponding writes in OĈ for two processes p and q and
suppose wolog that w1p
L̂p̂
−→ w2p. Then deliver()msgw1
L̂p̂
−→ w1p
L̂p̂
−→ deliver()
msgw2
L̂p̂
−→ w2p which implies
deliver()
msgw1
L̂q̂
−→ w1q
L̂q̂
−→ deliver()
msgw2
L̂q̂
−→ w2q because POB(L(K)) requires that deliveries of messages
with the same label must agree. Hence, WRITE1
Lp
−→ WRITE2 and WRITE1
Lq
−→ WRITE2 by the construction of
Lp and Lq from L̂p̂ and L̂q̂ respectively.
16
6 Implementing The Partial-Order Broadcast Model on the Message-Passing
Network Model Using Tokens
The partial-order broadcast model is very similar to the message-passing network model. The READ and
WRITE operations are on local variables in both models, so they are mapped by the transformation with
the identity function. That is, READ(x) (respectively, WRITE(x, v)) is mapped to read(x) (respectively,
write(x, v)). We need only specify how to implement BCAST and DELIVER by sending and recving
messages. The processes in P are numbered starting at 0, and organized into a virtual ring such that
next(p) = (p + 1) mod |P |. A token is created for each label l ∈ L, and for each token, a thread is
created on each process to manage it.
The broadcast of a labeled update requires that all processes agree on the delivery order of all updates
with the same label. To ensure this, the transformation of a BCAST of a labeled update has the process acquire
the appropriate token for that label by synchronizing with its token thread, send a message containing the
update information to each process, and wait for acknowledgments from all the processes before completing.
Since unlabeled updates only require that program order is maintained, unlabeled messages are sent (with
send) to every other process without acquiring a token.
To avoid deadlock this transformation requires that BCASTs and DELIVERs are invoked by separate
threads.
Each call to PassToken manages one acquisition and subsequent release of a token. It returns only after
handshaking with ProtectedBcast to determine that the token is no longer needed and can be released.
(pattern) ⊣ recv() is pseudo-code that blocks until a message matching pattern is received and stored in
the appropriate pattern variables.
6.1 Correctness of TKNPOB(L)NW transformation
Theorem 6.1. LetP be a multiprogram that uses READs, WRITEs, BCASTs, and DELIVERs where each process
has two threads; one that calls DELIVER but not BCAST and one that calls BCAST but not DELIVER. Then
TKNPOB(L)NW (P ) correctly implements POB(L) on NW, for any label set L and any such P .
Proof.
Assume: Let Ĉ be a computation in C
(
TKNPOB(L)NW (P ),NW
)
and let C be the interpretation of Ĉ. Let
Ô denote the set of operations O
Ĉ
. To show POB(L)[C] we construct witness sequences that satisfy the
requirements of Definition 3.3.
Build: Choose some collection of witness sequences 〈 (Ô|p̂,
L̂p̂
−→) : p̂ ∈ TKNPOB(L)NW (P ) 〉. That is,
Witnesses[{L̂p̂ : ∀p̂ ∈ TKNPOB(L)NW (P )}, Ĉ,NW]. Recall that L̂p̂ denotes the sequence induced by the total
order
L̂p̂
−→. We now construct 〈 (OC |p,
Lp
−→) : p ∈ P 〉 from 〈 (O|p̂,
L̂p̂
−→) : p̂ ∈ TKNPOB(L)NW (P ) 〉 as follows.
For each p̂, first create the sequence Short(L̂p̂) from L̂p̂ by removing:
• all operations on the handshake variables needToken [l] and doorOpen [l] for all l ∈ L.
• all send operations except the first send of a bcastop() that is in L̂p̂.
• all recv operations except the recv of a DELIVER.
Observe that each operation remaining in Short(L̂p̂) is a target level operation that was produced from the
transformation of some specification level operation, rather than by a thread created by the transformation.
17
TKNPOB(L)NW Implementation;
Code for each process p ∈ P .
1. Transformation’s local target variables
needToken [l] for each l ∈ L; handshake boolean variable, initially TRUE
doorOpen [l] for each l ∈ L; handshake boolean variable, initially TRUE
x̂ ∀x ∈ V , replica of local variable, initial value is the initial value of x.
2. Transforming specification threads
Transformation of thread p.m to p̂.m and p.d to p̂.d :
TKNPOB(L)NW (BCAST(u, l))
1 if l 6= ⊥
2 then ProtectedBcast(u, l)
3 else bcastop(u, l)
TKNPOB(L)NW (DELIVER())
4 (q, p, [MESSAGE, u, l]) ⊣ recv()
5 if l 6= ⊥
6 then send(p, q, [ACK])
7 return u, l
TKNPOB(L)NW (READ(x))
8 return x̂
TKNPOB(L)NW (WRITE(x, v))
9 x̂← v
ProtectedBcast(m, l )
7 needToken[l ] ← TRUE
8 while ¬ doorOpen[l ] skip
9 bcastop(u, l)
10 doorOpen [l]← FALSE
11 needToken[l ] ← FALSE
bcastop(u, l)
8 forall q ∈ P
9 do send(p, q, [MESSAGE, u, l])
10 if l 6= ⊥
11 {Wait for acknowledgment}
12 then forall q ∈ P
13 do (q, p, [ACK]) ⊣ recv()
3. New target threads
One new token thread p̂.TokenThreadl,∀l ∈ L:
p̂.TokenThreadl :
13 if p̂ = 0
14 then send(p̂, next(p̂), [TOKEN, BCASTGROUPTOKENl])
15 loop
16 do PassTokenp̂(l)
next(p̂)
16 return (p̂+ 1) mod
∣∣∣P̂
∣∣∣
PassTokenp(l)
12 (q, p, [TOKEN, BCASTGROUPTOKENl]) ⊣ recv()
13 if needToken[l ]
14 then doorOpen[l ] ← TRUE
15 while needToken[l ] skip
16 send(p,next(p), [TOKEN, BCASTGROUPTOKENl])
Figure 3: Token implementation of Partial-Order Broadcast on the Message-Passing Network Model
Now convert each Short(L̂p̂) to an new sequence Lp of specification level operations by replacing each
target level operations with the specification level operations that produced it. More precisely:
18
Target Operation in transformation is replaced by
read(x)
v
∈ TKNPOB(L)NW (READ(x)v )
READ(x)
v
.
write(x, v) ∈ TKNPOB(L)NW (WRITE(x, v)) WRITE(x, v).
send(s, d,m) ∈ TKNPOB(L)NW (BCAST(m, l)) BCAST(m, l).
recv()
s,d,m
∈ TKNPOB(L)NW (DELIVER()m,l )
DELIVER()
m,l
.
Verify: The read and write operations in L̂p̂ are valid. Thus the subset that remains in Short(L̂p̂) re-
mains valid because it consists of exactly the subset of these operations that are applied to local variables,
and projecting a valid sequence onto all the operations applied to a subset of objects preserves validity. Each
read (respectively, write) in Short(L̂p̂) is replaced with the corresponding READ (respectively, WRITE)
in the construction of Lp, so for each p, all READ and WRITE operations are valid. To see that the BCAST
and DELIVER operations are also valid we must confirm that for each Lp, 1) no DELIVER precedes its corre-
sponding BCAST and, 2) no specific DELIVER occurs more than once. Sequence L̂p̂(and thus Short(L̂p̂)) is
valid, so no RECV precedes its corresponding SEND and no message is received more than once, ensuring
properties 1 and 2 after mapping from Short(L̂p̂) to Lp.
The following table shows the properties we prove to verify the constraints of Definition 3.3:
Constraint Lemma
∀p ∈ P : (OC |p,
Lp
−→) is a valid total order By construction of Lp as
proved above
∀p ∈ P : Extends[OC |p,
Lp
−→,
progp
−−→] Lemma 6.2
∀p ∈ P : Extends[OC |p,
Lp
−→,
delOrderC−−−−→] Lemma 6.3
∀p, q ∈ P, l ∈ L : Agree[OC |delivers(l),
Lp
−→,
Lq
−→] Lemma 6.6
∀p ∈ P : (BCAST(m, l) ∈ OC if and only if DELIVER()m,l ∈ OC |p) by construction of
Lp and the network model.
Lemma 6.2. ∀p ∈ P : Extends[OC |p,
Lp
−→,
progp
−−→]
Proof. Let o1, o2 ∈ OC |p such that o1
progp
−−→ o2. Let TSPOB(L)NW (o).o denote the operation in the transformation
of o that is mapped to o when ShortL̂p̂ is converted to Lp.
o1
TKNPOB(L)NW ✤
✤
✤
prog // o2
TKNPOB(L)NW✤
✤
✤
TKNPOB(L)NW (o1)
p̂rog // TKNPOB(L)NW (o2)
TKNPOB(L)NW (o1).o
?
p̂rog∩L̂p̂
1.
// TKNPOB(L)NW (o2).o
?
o1
corresponds
✤
✤
✤
Lp // o2
corresponds
✤
✤
✤
1. Extends[OC |p̂,
L̂p̂
−→,
p̂rogp
−−→]
Lemma 6.3. ∀p ∈ P : Extends[OC |p,
Lp
−→,
delOrderC−−−−→]
19
Proof. Let o1, o2 ∈ O|p such that o1 delOrder−−−→ o2. Then o1 = DELIVER()u1,l1 , o2 =
DELIVER()
u2,l2
and
BCAST(u1, l1)
progs−−→ BCAST(u2, l2) for some process s ∈ P .
BCAST(u1, l1)
progs //
TKNPOB(L)NW ✤
✤
✤
BCAST(u2, l2)
TKNPOB(L)NW✤
✤
✤
TKNPOB(L)NW (BCAST(u1, l1)).send(sˆ, pˆ,m1)
p̂rogŝ //
̂MessageOrder

TKNPOB(L)NW (BCAST(u2, l2)).send(sˆ, pˆ,m2)
̂MessageOrder

TKNPOB(L)NW (DELIVER()u1,l1 ).
recv()
sˆ,pˆ,m1
̂FifoChannel //

TKNPOB(L)NW (DELIVER()u2,l2 ).
recv()
sˆ,pˆ,m2
TKNPOB(L)NW (DELIVER()u1,l1 ).
recv()
sˆ,pˆ,m1 L̂p̂
// TKNPOB(L)NW (DELIVER()u2,l2 ).
recv()
sˆ,pˆ,m2
DELIVER()
u1,l1 Lp
//
corresponds
✤
✤
✤
DELIVER()
u2,l2
corresponds
✤
✤
✤
The next two results are sublemmas that provide the pieces for our last requirement, Lemma 6.6. In-
formally, the first shows that the messages sent by a BCAST are all received (and acknowledged) before the
BCAST completes. The second ensures that an BCAST of an update with label l by process p is implemented
while p̂ holds the token for label l. These two facts together with the circulation of the token from process to
process, combine to establish that all processes receive the update messages with label l in the same order.
Lemma 6.4. Each recv operation that corresponds to a specification level DELIVER operation is ordered
by ̂HappensBefore order between the invocation and the response of the BCAST operation that matches this
DELIVER.
Proof. Let recv()
qˆ,pˆ,[MESSAGE,u,l] correspond to some deliver
DELIVER()
u,l
, and let BCAST(u, l) be the matching
BCAST . Consider the send(pˆ, qˆ, [ACK]) that is sent after this receive.
recv()
qˆ,pˆ,[MESSAGE,u,l]
p̂rog // send(pˆ, qˆ, [ACK])
̂MessageOrder

send(qˆ, pˆ, [MESSAGE, u, l])
 _
̂MessageOrder
OO
p̂rog // recv()
pˆ,qˆ,[ACK]
g
G
❥❥❥
❥❥❥
❥❥❥
❥❥❥
❥❥❥
❥❥❥
BCAST(u, l).bcastop
The Lemma follows because ̂HappensBefore order extends p̂rog and ̂MessageOrder.
Lemma 6.5. For each l ∈ L, let Bl = {TKNPOB(L)NW (BCAST(u, l)).bcastop : BCAST(u, l) ∈ OC}. Then
(Bl,
̂HappensBefore
−−−−−−→) is a total order.
Proof. Informally, for any label l, the main thread p̂.main and the token thread for label l, p̂.tokenl, hand-
shake via the needToken [l] and doorOpen [l] variables. This ensures that a BCAST of an update with label l
by process p is implemented while p̂ holds that token.
More precisely:
20
p.main p.tokenl
recv()
[TOKEN,BCASTGROUPTOKENl]
p̂rog

write(needToken [l], TRUE)
̂WritesInto //
p̂rog

read(needToken [l])
TRUE
p̂rog

read(doorOpen [l])
TRUE
p̂rog

write(doorOpen [l], TRUE)
̂WritesIntooo
p̂rog

bcastop
p̂rog

write(doorOpen [l], FALSE)
p̂rog

write(needToken [l], FALSE)
̂WritesInto
// read(needToken [l])
FALSE
p̂rog

send([TOKEN, BCASTGROUPTOKENl])
̂HappensBefore order extends ̂WritesInto and p̂rog and ̂MessageOrder. So the above proof diagram
implies that given a bcastop, bcst
. . .
recv()
[TOKEN,BCASTGROUPTOKENl]
̂HappensBefore
−−−−−−→ invocation(bcst)
̂HappensBefore
−−−−−−→ response(bcst)
̂HappensBefore
−−−−−−→ send([TOKEN, BCASTGROUPTOKENl]) . . .
Any other bcastop with the same label l by any process, must similarly be preceded by
recv()
[TOKEN,BCASTGROUPTOKENl] and followed by send([TOKEN, BCASTGROUPTOKENl]). Since there is exactly
one token for label l, it follows by message validity that any two method calls to bcastop for label l are
related by ̂HappensBefore order.
Lemma 6.6. ∀p, q ∈ P, l ∈ L : Agree[OC |delivers(l),
Lp
−→,
Lq
−→].
Proof. Let DELIVER()
u1,l
,
DELIVER()
u2,l
∈ OC and let BCAST(u1, l), BCAST(u2, l) be, respectively, the BCAST op-
erations that match these DELIVER operations. Consider the method calls bcst-1 = TKNPOB(L)NW (BCAST(u1, l)).bcastop
and bcst-2 = TKNPOB(L)NW (BCAST(u2, l)).bcastop of the implementation of these BCASTs.
By Lemma 6.5, assume, without loss of generality, that bcst-1
̂HappensBefore
−−−−−−→ bcst-2.
Then for any process q:
21
bcst-1
̂HappensBefore

recv()
p,q,[MESSAGE,u1,l]
✤
̂HappensBefore1.
oo❴ ❴ ❴ ❴ ❴ ❴ ❴ ❴
L̂q̂

corresponds
❴❴❴❴❴❴❴❴ DELIVER()
u1,l
Lq

+3 +3
bcst-2
̂HappensBefore1.
✤//❴❴❴❴❴❴❴❴ recv()
r,q,[MESSAGE,u2,l]
corresponds
❴❴❴❴❴❴❴❴ DELIVER()
u2,l
1. By Lemma 6.4.
Since this holds for every process q, all processes agree on the order of DELIVER()
u1,l
and DELIVER()
u2,l
.
7 Implementing The Partial-Order Broadcast Model on the Message-Passing
Network Model Using Timestamps
The TSPOB(L)NW implementation (Figures 4 and 5) uses timestamps to enforce agreement on the order of deliv-
eries of updates with the same label. It generalizes Attiya and Welch’s implementation of totally ordered
broadcast [9].
READ and WRITE operations are mapped by the identity transformation to the network model. The oper-
ations BCAST and DELIVER are implemented by send(source, destination,message) and recv() operations.
No new threads need to be added in this implementation.
By the definition of POB(L), the implementation must ensure that all processes agree on the deliver
order of messages with the same label. This is achieved using timestamps. Each process has |L| priority
queues, one for each message label. For unlabeled messages, it has one fifo-queue for each process. For
priority-queues, we denote the enqueue and dequeue operations by priority-enQ and extractmin re-
spectively. For fifo-queues, we denote the enqueue and dequeue operations by fifo-enQ and fifo-deQ
respectively.
Labeled messages are handled as follows. Messages with the same label are priority-enQed into the
same priority-queue. extractmin removes the message with the minimum (timestamp, process id) pair.
To ensure that all processes DELIVER messages with the same label in the same order, the implementation
guarantees that no message is dequeued by extractmin before all messages with a smaller or equal times-
tamp have been received and priority-enQed. Processes keep their timestamps up to date by adopting
the largest timestamp of any received message and sending their updated timestamp to all other processes.
Unlabeled messages are handled slightly differently. They are fifo-enQed into the fifo-queue for the
sending process, but timestamps are not used for fifo-deQing because agreement of delivery order is not
required for unlabeled messages.
The definition of POB(L) also requires that each process delivers messages in an order that extends
the program order of the corresponding BCASTs. This is not automatically enforced because messages are
spread across multiple priority-queues and fifo-queues. So the implementation uses counters. A message
is only delivered by a process if its counter value is 1 bigger than the counter value of the last delivered
message from the same source.
Observe that the DELIVER transformation does the heavy lifting in this implementation. In order to avoid
race conditions and more complicated synchronization, this implementation requires that for each process,
at most one thread performs DELIVER. Priority-queues and fifo-queues can be constructed from just variables
because each queue is accessed by only one thread.
In Figures 4 and 5, messages are designated as one of three types: LOCAL-BROADCAST-REQUEST ,
TS-UPDATE, ORD-MSG. Depending on type, a message can contain a timestamp, counter, sender id (denoted
m.src) as well as the label and value for the requested update.
22
TSPOB(L)NW : Code for each p ∈ P
1. Transformation’s local target variables
local-counter last broadcast counter value
counter [1 ..|P |] array of last received counter values, one for each process, initially all 0
T [1 ..|P |] array of last received timestamp values, one for each process, initially all 0
priorityQ [1 ..|L|] array of priority-queues for messages, one for each label l ∈ L, initially all empty
fifoQ [1 ..|P |] array of fifo-queues for unlabelled messages, one for each process, initially all empty
x̂ for each x ∈ V , target-level name for x
2. Transforming specification threads
Transformation of thread p.m to p̂.m and p.d to p̂.d :
TSPOB(L)NW (READ(x))
1 return x̂
TSPOB(L)NW (WRITE(x, v))
2 x̂← v
TSPOB(L)NW (BCAST(update, l))
3 send(p̂, p̂, [ LOCAL-BROADCAST-REQUEST ,
update,l
])
TSPOB(L)NW (DELIVER())
{ l can be ⊥ }
4 while (¬∃l ∈ L : CanExtract(priorityQ [l ]))
5 ∧(¬∃p̂ ∈ P̂ : CanDequeue(fifoQ [p̂]))
6 do HandleMessage()
7 case (choose (Ll : CanExtract(priorityQ [l ]))
8 |(Up̂ : CanDequeue(fifoQ [p̂])))
9 of
10 (Ll) then qe← extractmin(priorityQ [l ])
11 (Up̂) then qe← fifo-deQ(fifoQ [p̂])
12 counter[qe.src]← qe.counter
13 return qe. update, l
HandleMessagep()
13 ŝ, p̂,message ← recv()
14 case message of:
15 [LOCAL-BROADCAST-REQUEST, update, l]
16 then T [p̂]← T [p̂] + 1
17 local -counter ← local -counter +1
18 queue-element ← [update, T [p̂], local -counter , p̂]
19 ProcessQueueElement(queue-element , l, p̂)
20 FifoBroadcast([ORD-MSG, l, queue-element ])
21 [TS-UPDATE, timestamp, q̂]
22 then T [q̂]← timestamp
23 [ORD-MSG, l, queue-element]
24 then T [ŝ]← queue-element . timestamp
25 ProcessQueueElement(queue-element , l, ŝ)
26 if queue-element . timestamp > T [p̂]
27 then T [p̂]← queue-element . timestamp
28 FifoBroadcast([TS-UPDATE, T [p̂], p̂])
3. New target threads
No new target threads for this implementation.
Figure 4: Timestamp Implementation of Partial-Order Broadcast on the Message-Passing Network Model
7.1 Correctness of the TSPOBNW implementation
Theorem 7.1. Let P be any multiprogram that uses READs, WRITEs, BCASTs, and DELIVERs such that at
most one thread in each process calls DELIVER. Then TSPOB(L)NW (P ) correctly implements POB(L) on NW, for
any label set L and any such P .
Proof.
Assume: Let Ĉ be a computation in C
(
TSPOB(L)NW (P ),NW
)
and let C be the interpretation of Ĉ. Let Ô
denote the set of operations O
Ĉ
. To show POB(L)[C], we construct witness sequences that satisfy the
23
CanExtract(priorityQ)
28 if priority-isempty(priorityQ)
29 then return FALSE
30 else qe← peek-min(priorityQ )
31 return (qe.counter = counter[qe.src] + 1)
∧(∀q̂ ∈ P̂ : qe.ts ≤ T [q̂])
CanDequeue(fifoQ)
31 if isempty(fifoQ)
32 then return FALSE
33 else qe← peek-head(fifoQ)
34 return (qe.counter = counter[qe.src] + 1)
ProcessQueueElement(queue-element , l, ŝource)
25 if l 6= ⊥
26 then priority-enQ(priorityQ [l ], queue-element)
27 else fifo-enQ(fifoQ [ŝource ], queue-element)
FifoBroadcast(message)
12 for q̂ ∈ P̂ \ {p̂}
13 do send(p̂, q̂,message)
Figure 5: Auxiliary Functions for the Timestamp Implementation
requirements of Definition 3.3.
Build: Choose a collection of witness sequences 〈 (Ô|p̂,
L̂p̂
−→ ) : p̂ ∈ TSPOB(L)NW (P ) 〉. That is,
Witnesses[{L̂p̂ : ∀p̂ ∈ P̂}, Ĉ,NW]. Recall that we use L̂p̂ to denote the sequence induced by (Ô|p̂,
L̂p̂
−→).
Construct the sequence Short(L̂p̂) from L̂p̂ by removing:
1. all the operations on the T , counter , and local-counter variables.
2. all send and recv operations except the send of a [LOCAL-BROADCAST-REQUEST ].
3. all operations on priority-queues and fifo-queues except extractmin and fifo-deQ operations.
The operations remaining in Short(L̂p̂) are reads, writes, sends, extractmins and fifo-deQs. Each
such operation, op, was produced from a transformation of some unique specification level operation, de-
noted lift(op), of the main thread. Specifically:
lift(read(x)
v
) = READ(x)
v
lift(write(x, v)) = WRITE(x, v)
lift(send(s, d, [LOCAL-BROADCAST-REQUEST , update , l])) = BCAST(update , l)
lift(extractmin(priorityQ [l])
m
) = DELIVER()
m,l
lift(fifo-deQ(fifoQ [p])
m
) = DELIVER()
m,⊥
(where ⊥ denotes an unlabelled update).
24
For a sequence S, Lift(S) denotes the sequence formed by applying lift() to each operation in S. Define the
sequence Lp to be Lift(Short(L̂p̂)).
Verify: We now complete the proof by showing that the sequences 〈 Lp : p ∈ P 〉 just constructed are
witness sequences for POB(L)[C]. We do this by verifying the constraints of Definition 3.3.
Constraint of Definition 3.3 Lemma
∀p ∈ P : (OC |p,
Lp
−→) is a valid total order Lemma 7.7
∀p ∈ P : Extends[OC |p,
Lp
−→,
prog
−−→] Lemma 7.6
∀p ∈ P : Extends[OC |p,
Lp
−→,
delOrderC−−−−→] Lemma 7.8
∀p, q ∈ P, l ∈ L : Agree[OC |delivers(l),
Lp
−→,
Lq
−→] Lemma 7.9
∀p ∈ P : (BCAST(m, l) ∈ OC if and only if DELIVER()m,l ∈ OC |p ) Lemma 7.5
Several parts of the following proofs are the same for labeled and unlabeled updates. When this
is the case, we use “queue” to mean any priority-queue or fifo-queue. We use enQ to denote either a
priority-enQ applied to priorityQ [l ] for some label l, or a fifo-enQ applied to fifoQ [p̂] for some pro-
cess p̂. Similarly, deQ denotes either an extractmin or a fifo-deQ. A subscript on a local variable in-
dicates which process owns that variable. For example, Tp̂[q̂] denotes p̂’s variable T [q̂]. Similarly, a subscript
on an operation indicates which process applied the operation. For example, priority-enQp̂(priorityQ ,m)
denotes that this priority-enQ was applied by p̂.
We begin with three sublemmas that capture the essential properties of timestamps. We rely on these
lemmas later.
Lemma 7.2. For all processes p̂ and r̂, the writes to Tp̂[r̂] taken in program order have strictly increasing
values.
Proof. Tp̂[p̂] changes value only in Line 15 where it is incremented, and Line 26 where it is boosted to a
bigger value. Therefore Tp̂[p̂] never decreases.
For r̂ 6= p̂, p̂ writes a new value t to Tp̂[r̂] only in Lines 21 and 23 because p̂ received a TS-UPDATE
or ORD-MSG message from r̂ with timestamp t. So consider the timestamps in TS-UPDATE and ORD-MSG
messages sent by r̂ to p̂. We have just seen that Tr̂[r̂] does not decrease, so, given the increment in Line
15, any ORD-MSG sent by r̂ to p̂ (Line 19) contains a strictly bigger timestamp than that of any previous
message sent by r̂ to p̂. Similarly, given the increase in Line 26, any TS-UPDATE message sent by r̂ (Line
27) contains a strictly bigger timestamp than that of any previous message sent by r̂ to p̂. Since messages
are received in fifo order, these messages that p̂ receives from r̂ have increasing timestamps, confirming that
each of p̂’s writes to Tp̂[r̂] writes a bigger value than was previously written.
Lemma 7.3. If m1 and m2 are ORD-MSG messages with labels g and h (h 6= ⊥) and queue-elements qe1
and qe2 respectively such that qe1.ts ≤ qe2.ts then for all p̂ ∈ P̂ ,
enQp̂(queueg , qe1)
p̂rog
−−→
extractminp̂(priorityQ [h])
qe2
.
Proof. For a process p̂ in P̂ to execute extractminp̂(priorityQ [h])
qe2
, CanExtractp̂(priorityQ [h]) must have
returned TRUE, implying Tp̂[q̂] ≥ qe2.ts for every process q̂, and hence, for qe1.src. By Lemma 7.2, each
write to Tp̂[qe1.src] is an increasing value, so Tp̂[qe1.src] ≥ qe2.ts remains true.
Notice that each enQp(queueg, qe1) is called from either Line 18 or Line 24, and each is preceded by a
write of qe1.ts to Tp̂[qe1.src] (Lines 15 and 23). Thus:
25
writep̂(Tp̂[qe1.src], qe1.ts)
p̂rog 1. //
p̂rog
**❯❯❯
❯❯❯
❯❯❯
❯❯❯
❯❯❯
❯❯❯
❯

readp̂(Tp̂[qe1.src])
val
p̂rog2.

enQp̂(queueg, qe1)
p̂rog
44✐✐✐✐✐✐✐✐✐✐✐✐✐✐✐✐✐✐
1. val ≥ qe1.ts 2. val ≥ qe2.ts by code
extractminp̂(priorityQ [h])
qe2
Lemma 7.4. If any process enQs a queue-element with timestamp ts, then for all processes q̂ and r̂ eventu-
ally the value for Tq̂[r̂] becomes and remains at least ts.
Proof. By Lemma 7.2, Tq̂[r̂] never decreases, so it suffices to show that it eventually takes on a value equal
to at least ts.
A queue-element, say qe = [u, ts, c, p̂] , can be enQed by process p̂ in line 18 or by process q̂ 6= p̂ in line
24 of HandleMessage. Even in the second case, however, qe was previously enQed by process p̂ in line 18.
Process p̂ enQs qe as a consequence of its own LocalBraoadcastRequest and incremented Tp̂[p̂] to equal
ts in Line 15. It then broadcasts qe to every other process. For each other process r̂, when r̂ receives qe, if
Tr̂[r̂] is smaller than ts, then it is boosted in Line 26 to equal ts. Thus, for every r, Tr̂[r̂] is eventually at
least as big as ts. Furthermore, r̂ FifoBroadcasts every change of Tr̂[r̂] via either an ORD-MSG at Line 19
or a TS-UPDATE message at Line 27, which upon receipt, by each other process q̂, causes q̂ to set Tq̂[r̂] to
the received value (Lines 21 and 23). It follows that for all processes q̂, r̂ ∈ P̂ , eventually q̂’s value for T [r̂]
is at least ts.
Lemma 7.5. BCAST(update , l) ∈ OC if and only if DELIVER()update ,l ∈ OC |p,∀p.
Proof. Each transformation of a BCAST(update , l) by process p generates a unique LOCAL-BROADCAST-
REQUEST by p̂, the transformation of p. Each LOCAL-BROADCAST-REQUEST results in the preparation
of a queue-element , say η, (Line 17) that contains update , and which is a parameter in the call by p̂ to
ProcessQueueElement (Line 18). This call results in an enQ to priorityQ [l ] if l 6= ⊥ (Line 10) or to
fifoQ [p̂] if l = ⊥ (Line 11).
Process p̂ next sends a copy of η to every other process (Line 19) in HandleMessage. Therefore, each
BCAST(update , l) by process p results in an enQ of η at p̂, and also results in an enQ of η at every other
remote process:
send(p̂, p̂, [LOCAL-BROADCAST-REQUEST ])
̂MessageOrder // recv()
p̂,p̂,[LOCAL-BROADCAST-REQUEST]
p̂rog
qq❞❞❞❞❞❞❞❞
❞❞❞❞❞
❞❞❞❞❞
❞❞❞❞❞
❞❞❞❞❞
❞❞❞❞❞
❞❞❞❞❞
❞❞❞
enQpˆ(. . . , η)
p̂rog

send(p̂, q̂, [ORD-MSG, l, η])
̂MessageOrder
// recv()
p̂,q̂,[ORD-MSG,l, η]
p̂rog
qq❞❞❞❞❞❞❞
❞❞❞❞❞
❞❞❞❞❞
❞❞❞❞❞
❞❞❞❞❞
❞❞❞❞❞
❞❞❞❞❞
❞❞❞❞❞
❞❞❞❞
enQqˆ(. . . , η)
26
Because these are the only two ways anything is enQed, (a call to ProcessQueueElement from Line 18
or from Line 24) there is a 1-1 correspondence between the set of all BCAST operations by all processes
in the specification level, and the set of all enQs for each process, p̂ in the implementation. Therefore,
each process eventually enQs the same set of queue-elements. Furthermore, each deQ operation by p̂
corresponds to exactly one DELIVER operation by p (see the code for DELIVER). Since only enQed messages
can be deQed, it only remains to show that every queue-element , say η, that is enQed is eventually deQed.
By Lemma 7.4, for all processes q̂, r̂ ∈ P̂ , eventually Tq̂[r̂] becomes and remains greater than or equal
to η.ts.
Therefore, eventually every priority-enQed queue-element with label l 6= ⊥ will forever satisfy the
timestamp part of the CanExtract predicate. Furthermore, if some priorityQ [l] contains an queue-element
that satisfies this timestamp part of CanExtract, then the highest priority queue-element in priorityQ [l]
does, because priority decreases with increasing timestamp.
We now show that eventually η will also forever satisfy the counter part of the CanExtract or CanDequeue
requirement. Let Sp̂ be the set of queue-elements, γ, in p̂’s collection of queues, such that either 1) γ has
label l 6= ⊥, has highest priority in priorityQ [l ], and satisfies the timestamp part of the CanExtract re-
quirement, or 2) γ has no label and is at the head of its fifoQ . We just established that this set cannot remain
empty. Let qep̂ be the queue-element in Sp̂ with the least (ts, ŝource) pair when it is not empty.
Since ŝource sends messages in order of increasing timestamp, and channels are fifo, unlabeled queue-
elements from ŝource enter fifoQ [ŝource] in order of increasing timestamp. Also, each priorityQ is ordered
by increasing timestamp. So every other message from ŝource with timestamp smaller than ts must have
been delivered, implying counter [source ] + 1 must be equal to qep̂.counter. Therefore, either qep̂ has a
label l 6= ⊥ and satisfies CanExtract or qep̂ has label ⊥ and satisfies CanDequeue.
Thus, provided only a finite number of messages are BCAST (or, in longlived computations, given a weak
fairness constraint) every queue-element that is enQed will eventually be deQed. Therefore, BCAST(m, l) ∈
OC if and only if DELIVER()m,l ∈ OC |p for every process p.
Lemma 7.6. ∀p ∈ P : Extends[O|p, Lp−→, prog−−→]
Proof. Let o1, o2 ∈ O|p such that o1 prog−−→ o2. For o ∈ {o1, o2}, TSPOB(L)NW (o).op denotes the operation in the
transformation of o so that lift(TSPOB(L)NW (o).op) = o.
o1
TSPOB(L)NW ✤
✤
✤
prog // o2
TSPOB(L)NW✤
✤
✤
TSPOB(L)NW (o1)
p̂rog // TSPOB(L)NW (o2)
TSPOB(L)NW (o1).op1
?
p̂rog //
1.

TSPOB(L)NW (o2).op2
?
TSPOB(L)NW (o1).op1
L̂p̂
// TSPOB(L)NW (o2).op2
o1
lift
✤
✤
✤
Lp // o2
lift
✤
✤
✤
1. Extends[O|p̂,
L̂p̂
−→,
p̂rog
−−→] by the definition of the network model.
Lemma 7.7. ∀p ∈ P : (O|p, Lp−→) is a valid total order.
27
Proof. L̂p̂ is valid. Hence, the modified L̂p̂ after step 1, formed by removing all operations on some subsets
of objects, is valid. After step 2, the sequence remains valid because removing some send and recv
operations maintains the required validity property: a sequence of message operations is valid if it contains
at most one send and recv of any message. After step 3, the subsequence of Short(L̂p̂) consisting of
variables and network operations is valid. However, the subsequence of Short(L̂p̂) consisting of queue
operations contains only deQp̂ operations and is not valid. We now show that validity is restored in Lp =
Lift(Short(L̂p̂)).
The subsequence of Lp consisting of only variables is valid since lift() is essentially an identity map
for operations on variables. It remains to show validity for BCASTp and DELIVERp operations. Recall that a
sequence of BCASTp and DELIVERp operations is valid if
1. DELIVERp does not precede its corresponding BCASTp, and
2. no specific DELIVERp occurs more than once.
The proof of Lemma 7.5 showed that each update is DELIVERed exactly once in each Lp, establishing (2).
Let send(p̂, p̂, [LOCAL-BROADCAST-REQUEST , u, l]) and deQp̂(queuel)[u,ts,c,p̂] be operations in OĈ |p̂ . The next
diagram establishes (1).
send(p̂, p̂, [ LOCAL-BROADCAST-REQUEST,
u,l
])
̂MessageOrder

send(p̂, p̂, [ LOCAL-BROADCAST-REQUEST,
u,l
])
L̂p̂

BCASTp(u, l)
Lp

lift
❴ ❴ ❴
recv()
p̂,p̂,[LOCAL-BROADCAST-REQUEST,u,l]
p̂rog

1.
+3
priority-enQp̂(queuel, [ ORD-MSG,u,l,ts,c,p̂ ])
L̂p̂

2.
+3
extractminp̂(queuel)
[ORD-MSG,u,l,ts,c,p̂]
extractminp̂(queuel)
[ORD-MSG,u,l,ts,c,p̂]
DELIVERp()
u,l
lift❴ ❴ ❴ ❴ ❴ ❴ ❴
1. By definition of network model Extends[O|p,
̂HappensBefore
−−−−−−→,
p̂rog
−−→ ∪
̂MessageOrder
−−−−−−→] and
Extends[O|p,
L̂p̂
−→,
̂HappensBefore
−−−−−−→].
2. By construction Lp−→ from definition of lift.
Lemma 7.8. ∀p ∈ P : Extends[O|p, Lp−→, delOrderC−−−−→]
Proof. Let o1, o2 ∈ O|p such that o1 delOrder−−−→ o2. Then, by definition of delOrder, o1 = DELIVERp()u1,l1 ,
o2 =
DELIVERp()
u2,l2
and there is a process q such that BCASTq(u1, l1)
prog
−−→ BCASTq(u2, l2).
28
TSPOB(L)NW (BCASTq(u1, l1))
prog // TSPOB(L)NW (BCASTq(u2, l2))
send(q̂, q̂, [ LOCAL-BROADCAST-REQUEST,
u1,l1
])
?
p̂rog //
̂MessageOrder


send(q̂, q̂, [ LOCAL-BROADCAST-REQUEST,
u2,l2
])
?
̂MessageOrder

HandleMessage . recv()
q̂,q̂,[...,u1,l1] ̂FifoChannel
//
p̂rog

HandleMessage . recv()
q̂,q̂,[...,u2,l2]
p̂rog

write(local-counter , x1)
p̂rog

write(local-counter , x2)
p̂rog

send(q̂, p̂, [ ORD-MSG,l1,[u1,·,x1,q̂] ])
̂MessageOrder

p̂rog
55❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦❦
p̂rog //

send(q̂, p̂, [ ORD-MSG,l2,[u2,·,x2,q̂] ])
̂MessageOrder

recv()
q̂,p̂,[ ORD-MSG,l1,
[u1,·,x1,q̂]
] ̂FifoChannel
//
p̂rog

recv()
q̂,p̂,[ ORD-MSG,l2,
[u2,·,x2,q̂]
]
p̂rog

enQ(queuel1 , [u1, ·, x1, q̂])
p̂rog
22❢❢❢❢❢❢❢❢❢❢❢❢❢❢❢❢❢❢❢❢❢❢❢❢❢❢❢❢❢❢❢❢
enQ(queuel2 , [u2, ·, x2, q̂])
Each write to local-counter increments it by 1, so 0 < x1 < x2.
To DELIVER each of these updates, deQp̂(queuel1)[u1,·,x1,q̂] and
deQp̂(queuel2 )
[u2,·,x2,q̂]
must both be in Ô|p̂. For i ∈ {1, 2},
if label li 6= ⊥, then p̂ can extractmin(priorityQ [li])[ui,·,xi,q̂] only if its call to CanExtract(priorityQ [li]) returns
TRUE, where [ui, ·, xi, q̂] is the queue-element at the head of priorityQ [li] and xi is exactly one bigger than
the value stored by p̂ for counter[q̂] (see line 30 of CanExtract). Similarly, for i ∈ {1, 2}, if label li = ⊥,
then p̂ can fifo-deQ(fifoQ [q̂])[ui,·,xi,q̂] only if its call to CanDequeue(fifoQ [q̂]) returns TRUE, where [ui, ·, xi, q̂]
is the queue-element at the head of fifoQ [q̂] and xi is exactly one bigger than the value stored by p̂ for
counter[q̂] (see line 24 of CanDequeue).
Each process’ counter[src] starts at 0 and is incremented by 1 if and only if a message from [src] is
deQed (see line 11 of DELIVER()). Since x1 < x2, p must deQ u1 before u2:
deQp̂(queuel1 )
[u1,·,x1,q̂]
p̂rog //
1.

deQp̂(queuel2)
[u2,·,x2,q̂]
deQp̂(queuel1 )
[u1,·,x1,q̂] L̂p̂
//

deQp̂(queuel2)
[u2,·,x2,q̂]
DELIVERp()
u1,l1
lift
✤
✤
✤
Lp
// DELIVERp()
u2,l2
lift
✤
✤
✤
1. Extends[O|p̂,
L̂p̂
−→,
p̂rog
−−→]
This proves that Extends[O|p, Lp−→, delOrder−−−→].
29
Lemma 7.9. ∀p, q ∈ P, l ∈ L : Agree[O|delivers(l), Lp−→, Lq−→]
Proof. For each process p there is a one-to-one correspondence between the set of DELIVER()s by process
p of updates with label l and the set of extractmins by process p̂ of queue-elements from priorityQ [l].
Let qe1 = [u1, ts1, ·, q̂] and qe2 = [u2, ts2, ·, r̂] be two such queue-elements with label l, where (ts1, q) <
(ts2, r). Then ts1 ≤ ts2, so by Lemma 7.3, for all p̂ ∈ P̂ , priority-enQp̂(priorityQ [l], qe1)
p̂rog
−−→
extractminp̂(priorityQ [l ])
qe2
. Therefore, by the definition of the priority queue (queue-elements are ordered
lexicographically by (timestamp, source) pair):
extractminp̂(priorityQ [l ])
qe1
p̂rog //

extractminp̂(priorityQ [l ])
qe2
extractminp̂(priorityQ [l ])
qe1
L̂p̂ //

extractminp̂(priorityQ [l ])
qe2
DELIVERp()
u1,l1
lift
✤
✤
✤
Lp
// DELIVERp()
u2,l2
lift
✤
✤
✤
Hence, for each label l, and all processes p, q, the orders ([O|delivers(l), Lp−→) and (O|delivers(l), Lq−→)
agree.
8 Summary, Open Questions and Future Work
This paper introduced partition consistency, a parameterized memory consistency model, from which other
known models can be instantiated. Four implementations of partition consistency on a message-passing
network of multithreaded nodes were also developed and proved correct. All implementations are structured
with a middle-level of abstraction which serves to modularize the implementations and simplify our proofs.
The implementations are based on Attiya and Welch’s slow-write/fast-read and fast-write/slow-read methods
[9]. Both the token-based and queue-based variants are achieved by extending Attiya and Welch’s total order
broadcast [9] to a partial order broadcast.
All four implementations were proven correct using a unified framework. Such unified descriptions of
memory consistency models at different levels of abstraction and the associated proof techniques provide
more confidence in proofs that are otherwise tedious, lengthy and ad hoc.
Our proofs assume that the specification-level computations always terminate. Extending these proofs
to long-lived computations is not involved but tedious. It would be useful to have a general technique
to “reduce” the long-lived case to finite cases. We also suggest that the framework, the proof set-up and
the diagrammatic proof descriptions used in this paper could be used to establish the correctness of other
memory consistency models for various multiprocess machines, or networks or languages (for example,
C++).
Let us call a correct implementation exact if every computation of the specification level is an interpre-
tation of some computation of the target level. Our implementations in this paper are correct but not exact;
there are computations that satisfy partition consistency that could not happen in our implementations. For
example, abstract memory consistency models such as P-RAM and PC-G allow a kind of cyclic causality,
such as the computation:
30
p : READ(x)1 , WRITE(y, 2)
q : READ(y)2 , WRITE(x, 1)
This computation contains a cycle. The first process must read the value written by the second process
before writing but the second process must read the value written by the first process before writing. This
problem can be overcome by adding a causality constraint to the memory consistency definitions. Our im-
plementations prohibit such cyclic computations. Hence, our implementations are stronger than the memory
consistency models they implement — the specifications allow such computations, but our implementations
do not. Though this computation may seem impossible in actual implementations, it could conceivably be
possible if there is a prediction system in place. Our proof method could still be used with such a system.
As a second example, consider the computation:
p : WRITE(x, 1), READ(y)3 ,
READ(x)
1
q : READ(x)1 , WRITE(x, 2), WRITE(y, 3)
This computation is possible in a P-RAM implementation that broadcasts to itself, provided there is no
guarantee that a process applies its own write before any other process applies that write. The first process
broadcasts WRITE(x, 1) which is received and applied by the second process. The second process then
broadcasts WRITE(x, 2) and WRITE(y, 3). The first process receives WRITE(x, 2) before its own WRITE(x, 1)
and so overwrites the 2 with a 1, even though WRITE(x, 1) “caused” WRITE(x, 2). This computation also
could not happen in the implementations in this paper. This shows that our implementations are not exact,
and that the simplest causality constraint added to the memory consistency definition is insufficient to make
it exact. Whether or not there is a simple strengthening of the partition consistency predicate to make our
current implementations exact remains a question for future research.
A related issue to exactness is optimality. We believe that our implementations use minimal synchro-
nization in the following sense. For every synchronization that is added by the transformation, there exists
a program whose transformation would create computations that do not satisfy partition consistency if that
synchronization is removed. Confirming this intuition is beyond the scope of this paper.
Since the transformations in this paper are general, they are not optimal for all programs. Transforming
individual programs may lead to more efficient implementations. Hence, another approach that we have not
followed but is pursued by others (see Section 2) aims to determine for each program what synchronization
is necessary and sufficient to preserve correctness on the target machine.
A different direction that could complement this research is the assessment of the performance gains of
partition consistency instantiations over sequential consistency. Our preliminary experiments on Westgrid’s
128 node “matrix” cluster [26] are inconclusive. But some of these instantiations, and particularly the weak
sequential consistency model, show a potential to outperform sequential consistency. This complementary
study will be the subject of future work.
9 Acknowledgments
This research was supported by the Natural Sciences and Engineering Research Council of Canada through
discovery grant number 41900-07. Two anonymous referees provided insightful comments that helped us
improve this submission.
31
References
[1] Luca Aceto, Anna Inglfsdttir, Kim Guldstrand Larsen, and Jiri Srba. Reactive Systems: Modelling,
Specification and Verification. Cambridge University Press, 8 2007.
[2] Sarita V. Adve. Using information from the programmer to implement system optimizations without
violating sequential consistency. Technical Report ECE 9603, Department of Electrical and Computer
Engineering, Rice University, March 1996.
[3] Sarita V. Adve. Data races are evil with no exceptions: technical perspective. Commun. ACM,
53(11):84, 2010.
[4] Divyakant Agrawal, Manhoi Choy, Hong Va Leong, and Ambuj K. Singh. Investigating weak memo-
ries using Maya. In Proceedings of the 3rd International Symposium on High Performance Distributed
Computing, pages 123–130, 1994.
[5] Mustaque Ahamad, Rida Bazzi, Ranjit John, Prince Kohli, and Gil Neiger. The power of processor
consistency. In Proceedings of the 5th International Symposium on Parallel Algorithms and Architec-
tures, pages 251–260, June 1993. Technical Report GIT-CC-92/34, College of Computing, Georgia
Institute of Technology.
[6] David Aspinall and Jaroslav Sevcı´k. Formalising Java’s data race free guarantee. In Proceedings of
the 20th International Conference on Theorem Proving in Higher Order Logics, pages 22–37, 2007.
[7] Hagit Attiya and Roy Friedman. Programming DEC-Alpha based multiprocessors the easy way. In
Proceedings of the 6th International Symposium on Parallel Algorithms and Architectures, pages 157–
166, 1994. Technical Report LPCR 9411, Computer Science Department, Technion.
[8] Hagit Attiya and Jennifer Welch. Sequential consistency versus linearizability. ACM Transactions on
Computer Systems, 12(2):91–122, 1994.
[9] Hagit Attiya and Jennifer Welch. Distributed Computing: Fundamentals, Simulations, and Advanced
Topics. Wiley-Interscience, 2004.
[10] Roland Backhouse. Program Construction: Calculating Implementations from Specifications. John
Wiley and Sons, Inc., New York, NY, USA, 2003.
[11] Mark Batty, Scott Owens, Susmit Sarkar, Peter Sewell, and Tjark Weber. Mathematizing c++ concur-
rency. In Thomas Ball and Mooly Sagiv, editors, POPL, pages 55–66. ACM, 2011.
[12] Hans-J. Boehm and Sarita V. Adve. Foundations of the c++ concurrency memory model. SIGPLAN
Not., 43(6):68–78, June 2008.
[13] Stephen Brookes. A semantics for concurrent separation logic. Theoretical Computer Science, 375(1-
3):227–270, 2007.
[14] Jerzy Brzezinski and Michal Szychowiak. Low cost coherence protocol for DSM systems with pro-
cessor consistency. In Adnan Yazici and Cevat Sener, editors, ISCIS, volume 2869 of Lecture Notes in
Computer Science, pages 916–925. Springer, 2003.
[15] Sebastian Burckhardt, Rajeev Alur, and Milo M. K. Martin. CheckFence: checking consistency of
concurrent data types on relaxed memory models. In Jeanne Ferrante and Kathryn S. McKinley,
editors, PLDI, pages 12–21. ACM, 2007.
32
[16] Vicent Cholvi, Antonio Ferna´ndez, Ernesto Jime´nez, and Michel Raynal. A methodological construc-
tion of an efficient sequential consistency protocol. In Proceedings of the 3rd IEEE International
Symposium on Network Computing and Applications, pages 141–148. IEEE Computer Society, 2004.
[17] Francisco Corella, Janice M. Stone, and Charles Barton. A formal specification of the PowerPC shared
memory architecture. Technical Report RC18638, IBM, 1994.
[18] Intel Corporation. Intel 64 memory ordering white paper. Technical Report SKU:318147-001, Intel
Corporation, 2007.
[19] Intel Corporation. Intel 64 and IA-32 Architectures Software Developer’s Manual, volume 3A: System
Programming Guide, Part 1. 2008. SKU:253668.
[20] Compaq Computer Corportaion. The Alpha Architecture Handbook. 1998. Order number: EC-
QD2KC-TE.
[21] W. H. J. Feijen and A. J. M. van Gasteren. On a Method of Multiprogramming. Springer-Verlag New
York, Inc., New York, NY, USA, 1999.
[22] Kourosh Gharachorloo, Anoop Gupta, and John Hennessy. Revision to memory consistency and event
ordering in scalable shared-memory multiprocessors. Technical Report CSL-TR-93-568, Computer
Systems Laboratory, Stanford University, April 1993.
[23] Kourosh Gharachorloo, Daniel Lenoski, James Laudon, Phillip B. Gibbons, Anoop Gupta, and John
Hennessy. Memory consistency and event ordering in scalable shared-memory multiprocessors. In
Proc. 17th Int’l Symp. on Computer Architecture, pages 15–26, May 1990.
[24] Jay L. Gischer. The equational theory of pomsets. Theoretical Computer Science, 61:199–224, 1988.
[25] James Goodman. Cache consistency and sequential consistency. Technical Report 61, IEEE Scalable
Coherent Interface Working Group, March 1989.
[26] Western Canada Research Grid. Westgrid website. http://www.westgrid.ca.
[27] Maurice Herlihy and Jeannette Wing. Linearizability: A correctness condition for concurrent objects.
ACM Trans. on Programming Languages and Systems, 12(3):463–492, July 1990.
[28] Lisa Higham and LillAnne Jackson. Translating between Itanium and Sparc memory consistency
models. In Phillip B. Gibbons and Uzi Vishkin, editors, SPAA, pages 170–179. ACM, 2006.
[29] Lisa Higham, LillAnne Jackson, and Jalal Kawash. Capturing register and control dependence in
memory consistency models with applications to the Itanium architecture. In Proceedings of the 20th
International Symposium on Distributed Computing, September 2006.
[30] Lisa Higham, LillAnne Jackson, and Jalal Kawash. Programmer-centric conditions for Itanium mem-
ory consistency. In Proceedings of the 8th International Conference on Distributed Computing and
Networking, December 2006.
[31] Lisa Higham, LillAnne Jackson, and Jalal Kawash. Specifying memory consistency of write buffer
multiprocessors. ACM Transactions on Computer Systems, 25(1), 2007.
[32] Lisa Higham, LillAnne Jackson, and Jalal Kawash. What is Itanium memory consistency from the
programmer’s point of view? Electr. Notes Theor. Comput. Sci., 174(9):63–84, 2007.
33
[33] Lisa Higham and Jalal Kawash. Critical sections and producer/consumer queues in weak memory
systems. In Proc. 1997 Int’l Symp. on Parallel Architectures, Algorithms, and Networks, pages 56–63,
December 1997.
[34] Lisa Higham and Jalal Kawash. Java: Memory consistency and process coordination (extended ab-
stract). In Proc. 12th Int’l Symp. on Distributed Computing, Lecture Notes in Computer Science volume
1499, pages 201–215, September 1998.
[35] Lisa Higham and Jalal Kawash. Tight bounds for critical sections in processor consistent platforms.
IEEE Transactions on Parallel and Distributed Systems, 17(10):1072–1083, 2006.
[36] Lisa Higham and Jalal Kawash. Implementing sequentially consistent programs on processor consis-
tent platforms. Journal of Parallel and Distributed Computing, 68(4):488–500, April 2008.
[37] C.A.R. Hoare. Communicating Sequential Processes (Prentice-Hall International Series in Computer
Science). Prentice Hall, April 1985.
[38] Javid Huseynov. Distributed shared memory home pages. http://www.ics.uci.edu/∼javid/dsm.html.
[39] Intel Corporation. Intel Itanium architecture software developer’s manual, volume 2: System architec-
ture. http://www.intel.com/, Oct 2002.
[40] Peter J. Keleher, Alan L. Cox, Sandhya Dwarkadas, and Willy Zwaenepoel. Treadmarks: Distributed
shared memory on standard workstations and operating systems. In USENIX Winter, pages 115–132,
1994.
[41] Michael Kuperstein, Martin T. Vechev, and Eran Yahav. Automatic inference of memory fences. In
Roderick Bloem and Natasha Sharygina, editors, FMCAD, pages 111–119. IEEE, 2010.
[42] Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communication of
the ACM, 21(7):558–565, July 1978.
[43] Leslie Lamport. How to make a multiprocessor computer that correctly executes multiprocess pro-
grams. IEEE Transactions on Computers, C-28(9):690–691, September 1979.
[44] Leslie Lamport. On interprocess communication (parts I and II). Distributed Computing, 1(2):77–85
and 86–101, 1986.
[45] Leslie Lamport. Specifying Systems: The TLA+ Language and Tools for Hardware and Software
Engineers. Addison-Wesley, July 2002.
[46] Richard J. Lipton and Jonathan S. Sandberg. PRAM: A scalable shared memory. Technical Report
180-88, Department of Computer Science, Princeton University, September 1988.
[47] Nancy A. Lynch. Distributed Algorithms. Morgan Kaufmann, 1996.
[48] Nancy A. Lynch. Distributed Algorithms (The Morgan Kaufmann Series in Data Management Sys-
tems). Morgan Kaufmann, 1st edition, April 1997.
[49] Jeremy Manson, William Pugh, and Sarita Adve. The Java memory model. In Jens Palsberg and
Martı´n Abadi, editors, POPL, pages 378–391. ACM, 2005.
[50] Robin Milner. Communication and Concurrency (Prentice Hall International Series in Computer
Science). Prentice Hall, 1st edition, September 1995.
34
[51] Olaf Mu¨ller. I/O automata and beyond: Temporal logic and abstraction in Isabelle. In Jim Grundy and
Malcolm C. Newey, editors, Proceedings of the 11th International Conference on Theorem Proving in
Higher Order Logics, volume 1479 of Lecture Notes in Computer Science, pages 331–348. Springer,
September 1998.
[52] Scott Owens, Susmit Sarkar, and Peter Sewell. A better x86 memory model: x86-TSO. In Stefan
Berghofer, Tobias Nipkow, Christian Urban, and Makarius Wenzel, editors, TPHOLs, volume 5674 of
Lecture Notes in Computer Science, pages 391–407. Springer, 2009.
[53] Vaughan R. Pratt. Modeling concurrency with geometry. In Proceedings of the 18th Annual ACM
Symposium on Principles of Programming Languages, pages 311–322, January 1991.
[54] John C. Reynolds. Separation logic: A logic for shared mutable data structures. In Proceedings of the
17th IEEE Symposium on Logic in Computer Science, pages 55–74. IEEE Computer Society, 2002.
[55] Susmit Sarkar, Peter Sewell, Jade Alglave, Luc Maranget, and Derek Williams. Understanding
POWER multiprocessors. In Mary W. Hall and David A. Padua, editors, PLDI, pages 175–186. ACM,
2011.
[56] Susmit Sarkar, Peter Sewell, Francesco Zappa Nardelli, Scott Owens, Tom Ridge, Thomas Braibant,
Magnus O. Myreen, and Jade Alglave. The semantics of x86-CC multiprocessor machine code. In
Zhong Shao and Benjamin C. Pierce, editors, POPL, pages 379–391. ACM, 2009.
[57] Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Magnus O. Myreen. x86-
TSO: a rigorous and usable programmer’s model for x86 multiprocessors. Commun. ACM, 53(7):89–
97, 2010.
[58] Dennis Shasha and Marc Snir. Efficient and correct execution of parallel programs that share memory.
ACM Trans. Program. Lang. Syst., 10(2):282–312, 1988.
[59] Robert C. Steinke and Gary J. Nutt. A unified theory of shared memory consistency. Journal of the
ACM, 51(5):800–849, 2004.
[60] Nathaly Verwaal. Ambiguous memory consistency models. Master’s thesis, Department of Computer
Science, The University of Calgary, 1998.
35
