Deconstructing Queue-Based Mutual Exclusion by Golab, Wojciech
ar
X
iv
:1
31
0.
73
97
v2
  [
cs
.D
C]
  3
0 O
ct 
20
13
Deconstructing Queue-Based Mutual Exclusion
Wojciech Golab∗†
wgolab@uwaterloo.ca
Hewlett-Packard Labs Technical Report HPL-2012-100
May 6, 2012
Abstract
We formulate a modular approach to the design and analysis of a particular
class of mutual exclusion algorithms for shared memory multiprocessor sys-
tems. Specifically, we consider algorithms that organize waiting processes into
a queue. Such algorithms can achieve O(1) remote memory reference (RMR)
complexity, which minimizes (asymptotically) the amount of traffic through the
processor-memory interconnect. We first describe a generic mutual exclusion
algorithm that relies on a linearizable implementation of a particular queue-like
data structure that we callMutexQueue. Next, we show two implementations of
MutexQueue using O(1) RMRs per operation based on synchronization prim-
itives commonly available in multiprocessors. These implementations follow
closely the queuing code embedded in previously published mutual exclusion
algorithms. We provide rigorous correctness proofs and RMR complexity anal-
yses of the algorithms we present.
1 Introduction
Synchronization is a fundamental challenge in asynchronous shared memory multi-
processor systems, where processes executing in parallel must exercise caution while
accessing shared data structures. Unless concurrent access to such data structures
is directly supported in hardware, careful coordination is necessary at the software
level to prevent corruption of the data structure and ensure that processes executing
on different processors reach consistent views of the data. The dominant approaches
to such coordination are mutual exclusion, and non-blocking synchronization.
Mutual exclusion (ME) was formulated by Dijkstra [9], and later formalized by
Lamport [17, 18]. In this approach, processes take turns accessing the shared data
structure. The execution path of each process is modelled as a repeating sequence
of four sections, illustrated below in Figure 1. Access to the shared data structure is
∗Research conducted during doctoral studies at the University of Toronto, Canada.
†Research partially supported by the Natural Sciences and Engineering Research Council
(NSERC) of Canada.
1
loop
Non-Critical Section (NCS)
Entry Protocol
{
Doorway (bounded)
Waiting room
Critical Section (CS)
Exit Protocol
forever
Figure 1: Execution path of a process participating in mutual exclusion.
confined to a special critical section (CS), to which the process must gain exclusive
access by executing the entry protocol. Similarly, an exit protocol is executed upon
completing the CS to signal that the CS can now be entered by another process.
Between executions of the CS, a process lives in the non-critical section (NCS). The
set of variables accessed by a process while in the CS or the NCS is disjoint from
the set of variables accessed while in the entry or exit protocol.
An execution by a process of the entry protocol, CS and exit protocol is referred
to as a passage (through the ME algorithm). The entry section is sometimes divided
into a doorway, where a process enters a queue by executing a bounded number of
steps, and a waiting room, where it waits for its predecessor in the queue to exit the
CS. This type of algorithm is referred to as first-come first-served (FCFS). Lockout
freedom (also referred to as starvation freedom) is a progress whereby every process
that begins the entry protocol eventually enters the CS, provided that no process
halts outside the NCS. Finally, deadlock freedom is a weaker property that guarantees
the same but only provided that no process passes through the CS infinitely often,
and is considered the weakest progress property required for correctness of a mutual
exclusion algorithm.
In contrast to mutual exclusion, non-blocking synchronization requires some
measure of progress regardless of the rates at which other processes are executing.
For example in wait-free synchronization [13], a process must complete each access
to the shared data structure in a finite number of their own steps. The key idea
behind universal constructions of wait-free data structures is that faster processes
assist slower ones in performing updates. To that end, processes exploit hardware
synchronization primitives to agree on the order in which updates are applied, and
hence on the state of the shared data.
Mutual exclusion and wait-freedom have complementary characteristics. ME is
a blocking approach since a fast process can spend an unbounded amount of time
in a busy-wait loop, which typically involves repeatedly testing or spinning on one
variable, while waiting for another process to complete the CS and then write that
variable. In contrast, in a wait-free algorithm a process ensures its own progress even
if all others halt at an arbitrary point in their execution. In that sense, wait-freedom
is a stronger progress property than the one underlying mutual exclusion. Not sur-
2
Figure 2: Shared memory architectures: DSM (left) and CC (right).
prisingly, there exist shared object types for which wait-free implementations are
provably more costly than blocking ones. For example, any wait-free N -process im-
plementation of fetch-and-increment using atomic read/write registers (subsequently
referred to simply as registers) and fetch-and-store requires Ω(logN) remote memory
references (RMR, discussed below) [15]. In contrast, there is a blocking implemen-
tation of fetch-and-increment from registers and fetch-and-store using only O(1)
RMRs. In light of its lower cost, mutual exclusion remains the dominant approach
to synchronization in practice.
1.1 Time Complexity of Mutual Exclusion Algorithms
Analyzing the time complexity of ME algorithms requires some awareness of the
shared memory hardware architecture as different memory operations may incur
significantly different latencies. This is due to the growing disparity between proces-
sor speed and memory access speed, which motivates multiprocessor designs based
on the paradigms of non-uniform memory access (NUMA) and/or caching. Two
important classes of such architectures are illustrated in Figure 2 [23, 3]. In a Dis-
tributed Shared Memory (DSM) machine, each memory module can be accessed
locally by some processor without involving the processor-to-memory interconnect,
thus reducing much of the latency. Processors in cache-coherent (CC) machines, on
the other hand, maintain local copies of data inside caches, which are synchronized
by a coherence protocol. Thus, any shared memory location can become local at
runtime to any processor in the CC model.
In both the DSM and CC models, memory operations are classified as either
3
remote or local. This classification is straightforward in the DSM model as locality
is determined through static allocation of a variable in particular a memory module.
In contrast, in the CC model locality of a memory operation is determined by the
state of the processor’s cache, which depends on prior steps of the same processor
and possibly others, as well as by the type of memory access (e.g., read versus
write), which determines the behaviour of the coherence protocol. For our purposes,
we consider the following ideal behaviour in the CC model: after a processor reads
a variable, this variable is held in that processor’s cache and can be read locally
(i.e., without incurring an RMR) until another processor writes the same variable.
In both the DSM and CC models, we will assume for worst-case analysis that each
process runs on a distinct processor.
Remote operations, referred to as remote memory references or RMRs, can be
orders of magnitude more costly than local ones. Consequently, RMR complexity
quantifies not only the overhead of accessing the processor-to-memory interconnect,
but also the main source of latency incurred while executing a mutual exclusion
algorithm. Mutual exclusion algorithms with bounded RMR complexity are referred
to as local-spin, and have been the focus of recent research [3] on shared memory
multiprocessors. In such an algorithm, all busy-waiting must be done by spinning
on locally accessible variables.
To obtain a more direct measure of time complexity, one can consider the over-
head of contention (for the processor-to-memory interconnect and shared memory
modules) in addition to RMRs. This overhead can be quantified by counting memory
stalls under the assumption that concurrent accesses to a common shared variable
are serialized [10]. In that case, the i’th process in the serialization order incurs i−1
memory stalls as it waits for its predecessors to complete their operations. The exact
overhead of contention depends on the shared memory architecture. Most notably,
in a bus-based system a snooping protocol makes it possible for multiple processes
to read a common shared variable simultaneously. In that case one counts memory
stalls for concurrent writes but not for concurrent reads.
Time complexity measures for mutual exclusion algorithms typically omit local
memory operations. Although local operations do have an impact on the overall
latency, such a complexity measure is generally unbounded. Even if the CS is empty,
due to the assumption of asynchrony there is no bound on the time that a process
leaving the CS takes to execute the exit protocol and allow the next process to
proceed into the CS. Furthermore, in an algorithm using only registers, even the
first process to enter the CS may perform an unbounded number of steps unless
it is executing solo [1]. It is possible to circumvent this problem by defining time
complexity in terms of a virtual clock that ticks once for every interval of time in
which every process has been given sufficient time to perform one operation on a
shared memory object. The response time of a mutual exclusion algorithm is the
number of such clocks ticks from the time a process leaves the NCS to the time it
enters the CS [6].
4
1.2 Contributions of This Paper
Consider the following simple and intuitively appealing idea for an FCFS mutual
exclusion algorithm: Processes wait in a queue to enter the CS. Only the head of
the queue may enter the CS. A process leaving the NCS adds itself to the end of
the queue and, if it is not the head of the queue, it waits by repeatedly reading a
local spin variable. A process leaving the CS removes itself from the (head of the)
queue. It then writes a shared variable to signal its successor (now the new head
of the queue), perhaps after checking if such a process exists, to stop waiting and
proceed into the CS.
Clearly, race conditions can arise when a process contending for entry to the CS
checks whether it is the head of the queue (perhaps as it does so another process is
about to enter the queue), and when a process leaving the CS checks whether there
is a successor in the queue (perhaps as it does so another process is about to become
its successor). Handling these race conditions properly, while relying on standard
synchronization primitives and using as few RMRs as possible, is a delicate task.
Several algorithms based on the above idea have appeared in the literature [4,
12, 23, 8, 22, 24, 19]. The common simple structure underlying all these algorithms,
however, is obscured by the intricate details of handling the race conditions described
above. Furthermore, to our knowledge, some of these algorithms have not been
proved correct.
In this paper we propose a modular approach to the design and analysis of such
algorithms. We first define a queue-like shared data structure, called MutexQueue.
This data structure allows a process to add itself to the end of the queue, query
whether it is the head of the queue, and remove itself from the head of the queue
(simultaneously determining the identity of its successor in the queue, if one exists).
We then present a very simple generic queue-based mutual exclusion algorithm along
the lines described above, that uses this data structure as a “black box”. We prove
the correctness of this algorithm based on the abstract properties of MutexQueue.
This algorithm uses only a constant number of RMRs, beyond what are needed to
implement the “black box” MutexQueue, and applies only a constant number of
operations on MutexQueue, per passage.
We then present two implementations of MutexQueue, both using only a con-
stant number of RMRs for each operation in the DSM and CC models. The first
uses registers and the fetch-and-increment primitive (which atomically increments a
shared memory word and returns its previous value) while the second uses registers
and the fetch-and-store primitive (which atomically assigns a new value to a shared
memory word and returns its previous value).
The two implementations of MutexQueue are not novel: they are embedded
in previously published mutual exclusion algorithms; here, we have simply recast
them as implementations of the MutexQueue data structure. Specifically, the first
implementation of MutexQueue is based on a mutual exclusion algorithm due to Tom
Anderson [4], as subsequently modified by James Anderson and Yong-Jik Kim. The
second implementation of MutexQueue is based on a mutual exclusion algorithm
due to Craig [8]. To our knowledge, however, these mutual exclusion algorithms
5
have not been proved correct.1 In this paper we give rigorous correctness proofs of
these algorithms (as implementations of MutexQueue).
The advantage of our modular approach is that it “factorizes” the common struc-
ture of some queue-based algorithms, in the form of the generic mutual exclusion
algorithm. The correctness of this common part need only be proved once. What
is left in each of these algorithms, can be viewed as an implementation of the Mu-
texQueue data structure.
Our definition of the MutexQueue also sheds light on how exactly processes
coordinate access to the critical section in queue-based mutual exclusion algorithms.
For example, in such algorithms a process does not enter the queue and also discover
whether it became the head element in one atomic step. Rather, two atomic steps
are required, and are therefore represented by distinct MutexQueue operations. In
contrast, a process can exit the queue and discover its successor in one atomic
step. Surprisingly, sometimes a process can also exit the queue and discover no
other process, even though a successor does exist! In particular, this occurs if the
successor has entered the queue but has not yet queried the head element. Thus,
it is the latter step (i.e., querying the head) that makes a process “visible” to its
predecessor, and not the mere act of entering the queue.
2 Related Work
The RMR complexity of mutual exclusion algorithms is a function of the number
of processes, N . The best known upper bound on the worst-case RMR complexity
per passage of algorithms based on (atomic) read/write registers only is O(logN)
[25, 16]. This bound is tight [5]. The same tight bound holds for the class of mutual
exclusion algorithms that in addition to registers use compare-and-swap (CAS) or
load-linked/store-conditional (LL/SC) – primitives that conditionally change the
value of a shared memory location [11].
Using synchronization primitives such as fetch-and-store (i.e., swap between
shared memory and a private register) and fetch-and-increment, it is possible to
devise mutual exclusion algorithms with worst-case RMR complexity of only O(1)
[4, 12, 23, 8, 22, 24, 19].2 The properties of these algorithms are summarized in
Table 1. All of these algorithms are based on the concept of a process queue, which
determines the order in which processes enter the CS and enables efficient signaling
between processes that enter the CS consecutively. Thus, in addition to mutual
exclusion and lockout freedom, these algorithms also satisfy FCFS.
1A variant of Craig’s algorithm [8] is proved correct in [19]. This variant is intuitively simpler,
but uses an array of length 2N instead of N + 1 to encode the queue of processes waiting to enter
the critical section.
2The original algorithm of T. Anderson [4] uses a constant number of RMRs in the CC model but
is not local-spin in the DSM model. A constant-RMR DSM variant using the same synchronization
primitives can be obtained by applying the transformation described in [20] or in footnote 7 of [2].
Rhee’s algorithm [24] is targeted at a variant of the DSM model with weaker memory consistency,
where read/write operations executed by one processor may appear to take effect in a different
order to another processor due to buffering of writes. In this model, a special fence operation is
used to force previously buffered writes to take effect globally.
6
Publication RMR complexity Synchronization primitives
reference CC model DSM model (+ read/write registers)
[4] O(1) unbounded Fetch-and-Increment (unbounded counter)
[12] O(1) unbounded Fetch-and-Store
[23] O(1) O(1) Fetch-and-Store + Compare-and-Swap
[8] O(1) O(1) Fetch-and-Store
[22] O(1) unbounded Fetch-and-Store + Compare-and-Clear
[24] O(1) O(1) Fetch-and-Store
[19] O(1) O(1) Fetch-and-Store
Table 1: Properties of several constant RMR mutual exclusion algorithms.
fetch and φ(var, input)
1. old := var
2. var := φ(old, input)
3. return old
Figure 3: Fetch-and-φ primitive.
Many of the queue-based constant-RMR mutual exclusion algorithms cited above
were presented in the context of performance studies, and lack rigorous proofs of
correctness. Moreover, to our knowledge the only attempt to generalize or unify
these algorithms, all of which are based on the process queue concept, is the generic
algorithm of Anderson and Kim [2]. This algorithm solves mutual exclusion us-
ing O(1) RMRs per passage given a suitable shared-memory primitive fetch-and-φ,
which corresponds to the (atomic) execution of the pseudocode shown in Figure 3.
The fetch-and-φ primitive can be instantiated to a variety of shared-memory
primitives by choosing a suitable function φ. For example, a fetch-and-store corre-
sponds to
φ(old, input) ≡ input
Similarly, if we use input to encode a pair of values (a, b), a compare-and-swap
corresponds to
φ(old, (a, b)) ≡
{
b if old = a
old otherwise
where a and b are the expected and target value of compare-and-swap. Thus, fetch-
and-φ generalizes various types of read-modify-write primitives, including condition-
als.
Unlike its predecessors, the generic fetch-and-φ algorithm of [2] uses two process
queues instead of one, in order to cope with the generic and limited assumptions on
the behaviour of the fetch-and-φ primitive. Consequently, an additional mechanism
is needed to control access to the critical section, and the algorithm loses the (FCFS)
property inherent in earlier single-queue solutions.
7
Correctness of the generic algorithm depends on a condition on the fetch-and-φ
primitive related to its ability to return distinct values over repeated invocations.
This condition is formalized in terms of a property of a primitive called rank. In-
tuitively, the higher the rank, the better the primitive at solving mutual exclusion
efficiently with respect to RMR complexity. A rank of 2N or greater is sufficient
for the generic algorithm, but it is not known whether rank Ω(N) is necessary
for solving mutual exclusion with O(1) RMRs per passage. Examples of primi-
tives that have rank 2N or more include an r-bounded fetch-and-increment (i.e.,
φ(old, input) = min(r − 1, old + 1)) for r ≥ 2N , which has rank r, and fetch-and-
store, which has infinite rank. Compare-and-swap as well as test-and-set can also
be modeled as fetch-and-φ primitives, but both have rank only two.
Any mutual exclusion algorithm that uses only compare-and-swap and registers
requires Ω(logN) RMRs [5, 11]. In contrast, there are mutual exclusion algorithms
that use only fetch-and-store and registers that require only O(1) RMRs (e.g., [8]).
So, from the point of view of supporting RMR-efficient implementations of mutual
exclusion, fetch-and-store is more powerful than compare-and-swap. It is interesting
that the opposite is the case from the point of view of supporting wait-free imple-
mentations of objects. It is well-known from Herlihy’s work that compare-and-swap
and registers support wait-free implementation of any object shared by any number
of processes, while there are objects shared by only three processes that cannot be
implemented wait-free using only fetch-and-store and registers [13].
3 Road Map
First, we present the model of computation in Section 4. In Section 5 we present our
generic formulation of the queue-based mutual exclusion algorithm, and prove its
correctness properties, assuming a suitable implementation of MutexQueue, a novel
queue-like data structure. Then, in Sections 6 and 7, we discuss two implementations
of MutexQueue, based on the fetch-and-increment and fetch-and-store primitives,
respectively. Our implementations closely follow the queuing code embedded in
existing queue lock algorithms [4, 8]. We conclude in Section 8 with a discussion of
the applicability of our analysis technique.
8
4 Model of Computation and Definitions
Our model of computation is based on [14]. A concurrent system models an asyn-
chronous shared memory system where N processes communicate by executing op-
erations on shared objects. Formally, a concurrent system is represented as a triple
S = (P,V,H), where P = {0, 1, . . . , N − 1} is a set of process identifiers, V is a set
of shared objects, also referred to as variables, and H is a set of execution histories.
Each process identifier corresponds to a process, which is a sequential thread of con-
trol that invokes operations on objects, one at a time, and receives corresponding
responses. An object represents a data structure with a well-defined set of states and
set of operations that modify the state and return responses to processes. Processes
and objects can be formally modelled as input/output automata [21], but here we
adopt a more informal approach.
Steps
Informally, we think of the behaviour of processes in a concurrent system S =
(P,V,H) as a collection of steps. There are two categories of steps – atomic and
non-atomic. In an atomic step, a process p ∈ P applies operation op on some object
v ∈ V and receives the response ret of this operation. This is denoted by a tuple
(ATOM, p, v, op, ret). We use atomic steps to denote operations on atomic objects,
such as those provided in hardware. In a non-atomic step, a process p either invokes
an operation op on some object v ∈ V, or it receives the response ret of the last
operation p invoked on v. The former is called an invocation step, and is represented
by a tuple (INV, p, v, op). The latter is called a response step, and is represented
by a tuple (RES, p, v, ret). We use non-atomic steps (along with atomic steps) to
denote operations on objects that are simulated in software from atomic objects, as
explained later.
Execution Histories
An execution history, or history for short, is a sequence of steps. An execution history
is generated as processes accesses objects according to the transition functions of
the corresponding automata, which we will describe using pseudocode. The histories
we will consider contain either only atomic steps, or a combination of atomic and
non-atomic steps where each object is accessed by steps of exactly one category.
We say that H is a history of (or over) object v if every step in H accesses v.
A response step eR = (RES, p, v,−) in H matches the last preceding invocation step
eI = (INV, p, v,−) in H (if one exists).
3 An invocation step is pending in H if it is
not followed by a matching response step.
An operation execution in a history H ∈ H is either a pair of matching invoca-
tion/response steps, or a pending invocation step. We call an operation execution
complete in the former case, and pending in the latter. Two operation executions are
concurrent in H unless the response of one precedes the invocation of the other in H.
We say that H is sequential if it contains no concurrent operation executions, and
3Here and in the remainder of the paper, “−” denotes a wildcard value.
9
complete if it contains no pending invocations. The set H is prefix-closed, meaning
that if H ∈ H and G is a prefix of H then G ∈ H.
For every history H and set P of process IDs, we denote by H|P the maximal
subsequence of H consisting only of steps by processes in P . Similarly, for every
history H and set V of objects, we denote by H|V the maximal subsequence of H
consisting only of steps on objects in V . For a single process ID p or object v, we
use H|p and H|v as shorthands for H| {p} and H| {v}, respectively. A process p is
active in a history H if H|p is not empty.
Object Types and Conformity to a Type
Every object has a type τ = (P,S, sinit,O,R, δ) where P is a set of process IDs
(defined as for concurrent systems), S is a set of states, sinit ∈ S is the initial state,
O is a set of operations, R is the set of operation responses, and δ : P×S×O → S×R
is a (one-to-many) state transition mapping. The transition mapping δ is intended
to capture the behaviour of objects of type τ , in the absence of concurrency, as
follows: if a process p applies operation op to an object of type T that is in state
s, then the object may return to p the response r and change its state to s′ if and
only if (s′, r) ∈ δ(p, s, op). A complete, sequential execution history H of object v
of type τ induces a sequence of tuples (pi, opi, ri) such that in the i’th atomic step
or operation execution in H (depending on the structure of H), process pi applies
operation opi and receives response ri. We say that v conforms to τ in H if there
exists a sequence s0, s1, s2, . . . of states of τ such that s0 = sinit and for each i ≥ 1,
(si, ri) ∈ δ(pi, si−1, opi).
Algorithms
An algorithm is a concurrent system S = (P,V,H) where every history H ∈ H
contains only atomic steps over V. We call such a history a one-level history, to
distinguish it from the more complex execution history of an implementation, defined
later. The set of histories is defined informally through a pseudo-code procedure for
each process as follows: For each operation that a process p applies to a shared
variable, H records an atomic step that encodes the variable, the operation applied,
and its response. Accesses to private variables correspond to state changes in the
automaton for a process and are not explicitly recorded in the history. Steps of
different processes can be interleaved in H arbitrarily. An infinite history H of an
algorithm is fair if every process that is active in H takes infinitely many steps. (We
do not consider terminating algorithms in this paper.)
Implementations
An implementation describes how to simulate a target object of a particular target
type using a set of base objects of specified types. Specifically, for each operation of
the target type and each process, we define an access procedure that computes the
response of the operation under consideration by performing operations on the base
objects. An implementation is a concurrent system denoted I = (P,V,H) where the
set of shared objects V consists of a distinguished target object, denoted T , and a
set of base objects. Histories in H contain a combination of atomic and non-atomic
10
steps. Every history H ∈ H is well-formed, meaning that the following conditions
hold:
• T is accessed only using non-atomic steps, and for every base object v ∈ V, v
is accessed only using atomic steps.
• For every base object v ∈ V, v conforms to its type in H|v.
• If eI = (INV, p, T,−) is pending in H then eI is the last non-atomic step
performed by p in H.
• If eR = (RES, p, T,−) is in H then eR matches the last invocation step of p
that precedes eR in H.
• If eA = (ATOM, p,−,−,−) is in H then it occurs after some invocation step and
before the matching response (if one exists).
We call an execution history of an implementation a two-level history since operation
executions on the target object and on base objects are nested.
Histories inH correspond to executions of the access procedures as follows. When
a process p begins executing the access procedure for operation op on T , the history
records the step (INV, p, T, op). As p subsequently executes the access procedure,
the history records corresponding atomic steps by p on base objects. Finally, when
the access procedure returns a value ret, then the history records the response step
(RES, p, T, ret). Processes may call the access procedures arbitrarily many times and
in arbitrary order. An infinite history H of an implementation is fair if every process
that is active in H either takes infinitely many steps, or applies a response step as
its last step in H. Informally, this means that in a fair history a process may not
stop executing in the middle of an access procedure.
Linearizability
Linearizability [14] is widely accepted as a correctness condition for concurrent ob-
jects. Informally, it states that operation executions in a history of an implemen-
tation must appear to take effect instantaneously at some point between the cor-
responding invocation and response steps. Formally, linearizability is defined as
follows. Given a history H of an implementation, <H is the partial order over the
set of operation executions in H defined as follows: oe1 <H oe2 iff the response of
oe1 occurs in H before the invocation of oe2. Two execution histories G and H are
equivalent if every process executes the same sequence of steps in both histories.
Letting T denote the target object, a completion of H|T is a well-formed history H ′
obtained from H by either completing (with a response event) or removing every
pending operation execution. H|T is linearizable with respect to type τ if it has
a completion equivalent to some complete sequential history H¯ over T such that
<H⊆<H¯ and where T conforms to type τ in H¯. In this case we say that H¯ is a
linearization of H. We denote the set of possible linearizations of H by Lin(H). We
say that an implementation I = (P,V,H) is linearizable with respect to type τ if for
every history H ∈ H, H|T is linearizable with respect to type τ .
11
Additional Notation
Let G,H be execution histories. If s is a step, we denote by s ∈ H that step s occurs
in H, by proc(s) the process that executes s, and by var(s) the object on which s
operates. We denote by G  H that G is a prefix of H, and by G ≺ H that G is a
proper prefix of H. If v is an object and H is an execution history such that H|v is
complete and sequential, then we denote the state of v at the end of H|v by vH .
Given execution histories (or, more generally, sequences) H and G, let G ◦ H
denote the concatenation of G and H (i.e., elements of H appended to G). If G
is finite, |G| denotes the length of G. For 0 ≤ i < |G|, G[i] denotes the i’th step
(counting from 0) of G. G[i..j] denotes the subsequence of G consisting of all G[k]
such that i ≤ k ≤ j.
12
5 Generic Queue-Based Algorithm
5.1 The MutexQueue Type
An N -process MutexQueue is a queue-like object type that stores a subset of N
process IDs (subsequently also referred to as processes). The state of MutexQueue
is an ordered pair (Q,V ), where Q and V are a sequence and a set, respectively, of
elements from P. Informally, Q represents the sequence of processes waiting to enter
the critical section, and V is a subset of these processes that are visible. Intuitively,
a process becomes visible when it “makes itself known” to its predecessor in the
queue. The initial state is (〈〉 , ∅). In addition, we define a special broken state
⊥, indicating that a process has violated the etiquette for accessing MutexQueue
(explained below).
A MutexQueue supports three types of operations: enqueue(), isHead(), and
dequeue(). Informally, enqueue() adds the executing process to the end of the
queue and always returns the response OK; isHead() makes the executing process
visible and returns true if and only if this process is the head of the queue; and
dequeue() removes the executing process from the head of the queue, and returns
the ID of the successor process in the queue, if it exists and is visible (or −1 oth-
erwise). As mentioned earlier, processes are expected to follow a certain etiquette
in accessing MutexQueue. Specifically, a process must not invoke enqueue() if it is
already in the queue, isHead() if it is not in the queue or is already visible, and
dequeue() if it is not the head of the queue or is not visible. Failure to comply with
this etiquette causes the MutexQueue to enter the broken state ⊥, and thereafter
all responses are completely arbitrary. These restrictions on accessing MutexQueue
make it easier to implement this object. As we will see, they are observed by our
generic algorithm that uses MutexQueue to solve mutual exclusion (see Section 5.2).
Prima facie, it would seem that we can have a simpler definition of MutexQueue,
and a correspondingly simpler version of the generic mutual exclusion algorithm
based on MutexQueue, by combining the enqueue() and isHead() operations into
a single operation that adds the ID of the executing process to the end of the queue,
and returns true if that process is the head of the queue and false otherwise.
Unfortunately, the resulting operation seems too strong; we were not able to find
an implementation for it that uses standard synchronization primitives and incurs
only a constant number of RMRs. By splitting the functionality into two separate
operations, such implementations become feasible.
Formally, an N -process MutexQueue is specified by the tuple (P,O,R,S, τ)
where
P = {0, 1, . . . , N − 1}
O = {enqueue(), isHead(), dequeue()}
R = {true, false,−1} ∪ {0, 1, . . . , N − 1}
S = {⊥} ∪ {(Q,V ) | Q is a permutation of a subsequence of 〈0, 1, . . . , N − 1〉
and p ∈ V only if p ∈ Q}
and the state transition mapping τ is defined as follows:
13
τ(p, s, enqueue())
=
{
{((Q ◦ 〈p〉 , V ), OK)} if s = (Q,V ) and p 6∈ Q
{(⊥, ret) | ret ∈ R} otherwise
τ(p, s, isHead())
=


{((Q,V ∪ {p}), true)} if s = (Q,V ), p ∈ Q, p 6∈ V and Q[0] = p
{((Q,V ∪ {p}), false)} if s = (Q,V ), p ∈ Q, p 6∈ V and Q[0] 6= p
{(⊥, ret) | ret ∈ R} otherwise
τ(p, s, dequeue())
=


{((Q[1..|Q| − 1], V \ {p}), Q[1])} if s = (Q,V ), p ∈ Q, p ∈ V,Q[0] = p,
|Q| > 1 and Q[1] ∈ V
{((Q[1..|Q| − 1], V \ {p}),−1)} if s = (Q,V ), p ∈ Q, p ∈ V,Q[0] = p,
(|Q| = 1 or Q[1] 6∈ V )
{(⊥, ret) | ret ∈ R} otherwise
Observation 5.1. Let H be an execution history over an atomic N -process Mu-
texQueue object M such that MH = (Q,V ) 6= ⊥. Then the following hold:
(a) for every process p, if p ∈ V then p ∈ Q
(b) for every process p, Q contains at most one instance of p
Given a state s = (Q,V ) of a MutexQueue object, s 6= ⊥, we define the following
predicates and functions.
QProcs(s) := {p | p ∈ Q}
VisProcs(s) := V
empty(s) :=
{
true if Q = 〈〉
false otherwise
head(s) :=
{
Q[0] if |Q| ≥ 1
⊥ otherwise
pred(s, p) =
{
q if 〈q, p〉 is a subsequence of Q
⊥ otherwise if Q has no such subsequence
succ(s, p) =
{
q if 〈p, q〉 is a subsequence of Q
⊥ otherwise if Q has no such subsequence
Note that for every p ∈ P, the values pred(s, p) and succ(s, p) are uniquely defined
by Observation 5.1 (b). If p, q ∈ P and s is a MutexQueue state then we use the
phrases “s is empty,” “p is in the queue,” “p is the head of s,” “p is visible in s,”
“p is the successor of q in s” and “p is the predecessor of q in s,” to denote the
conditions empty(s), p ∈ QProcs(s), p = head(s), p ∈ VisProcs(s), p = succ(s, q),
and p = pred(s, q), respectively.
14
Shared variables:
Wait : array [0..N − 1] of Boolean, initially all true
(Wait [p] local to p on a DSM machine)
M : N -process MutexQueue
Private per-process variables:
nextHead: integer −1..N − 1
Algorithm for process p:
loop
NCS1
M.enqueue()2
if ¬M.isHead() then3
while Wait [p].read = true do4
end
Wait [p].write(true)5
end
CS6
nextHead :=M.dequeue()7
if nextHead 6= −1 then8
Wait [nextHead].write(false)9
end
forever
Figure 4: Algorithm GQME (Generic Queue-based Mutual Exclusion) for N pro-
cesses.
5.2 Generic Mutual Exclusion Algorithm
In this section we analyze the mutual exclusion algorithm shown in Figure 4. In
addition to an atomic MutexQueue object, the algorithm uses an arrayWait [00..N−
1] of Boolean read/write registers.
Informally, the algorithm uses the MutexQueue object M to maintain a queue of
processes that are competing to enter the critical section. The enqueue() operation
at line 2 constitutes the doorway, and the remaining statements leading up to the
CS comprise the waiting room. In the exit protocol, spanning lines 7 to 9, a process
signals its successor in M (if present and visible) to exit the waiting room and
proceed to the CS.
We use syntax of the form V.op(args) in Figure 4 to indicate that process p
invokes operation op(args) on the shared variable V . Operations on shared registers
are denoted read and write.
5.2.1 Correctness Properties
Mutual Exclusion (ME): at most one process is in the CS at any time.
15
First-Come First-Served (FCFS): processes enter the CS in the order in which they
are enqueued at line 2.
Lockout Freedom (LF): if a process leaves the NCS then it eventually enters the CS.
Bounded Exit (BE): if a process leaves the CS then it enters the NCS within a
bounded number of its own steps.
5.2.2 Proof of Correctness
Let S = (P,V,H) be the concurrent system corresponding to Algorithm GQME
where P = {0, 1, . . . , N − 1}, V = {M,Wait [0], . . .Wait [N − 1]}, and H is the set
of execution histories of Algorithm GQME. Each (concurrent) execution of Algo-
rithm GQME is represented by a one-level history H ∈ H as follows. For each oper-
ation that a process p applies to a shared variable, (e.g., M.enqueue() at line 2), H
records an atomic step.4 The sequence of steps of each process in H is determined
by the pseudocode shown in Figure 4. For example, if process p applies operation
isHead() to M (see line 3) with response false, then the next step of p in H (if one
exists) applies read to Wait [p]; otherwise, the next step of p in H (if one exists),
applies dequeue() to M . The steps of different processes can be interleaved in H
in any way provided that each variable in V conforms to its type in H.
For any history H ∈ H, any process p ∈ P, and any integer i ∈ Z+, we say that p
is in the CS in passage i at the end ofH if and only if p performs its last step inH dur-
ing its i’th passage through Algorithm GQME, and furthermore this step is: either
〈(ATOM, p,M, isHead(), true)〉 (see line 3); or 〈(ATOM, p,Wait [p], write(true), OK)〉
(see line 5). Similarly, we say that p has completed the CS in passage i at the end of
H if and only if H contains a step (ATOM, p,M, dequeue(),−) (see line 7) performed
by p during passage i through Algorithm GQME.
For ease of exposition, we distinguish a number of phases in which a process
may be at the end of a history H ∈ H. The phases are defined in Table 2 and
the transitions between them are illustrated in Figure 5. Each phase is bounded by
steps on shared objects.
Note that the first five phases defined in Table 2 are mutually exclusive, whereas
EXIT is a sub-phase of NEAR NCS, and is not necessarily traversed by a process in
every passage through Algorithm GQME. We will subsequently use the name of a
phase as a predicate indicating that a process is in the given phase, e.g., WAIT(p)H =
true iff process p is in the WAIT phase at the end of a history H ∈ H.
4Note that H does not record steps corresponding to the private variable nextHead. The value
of nextHead is part of the local state of a process.
16
Phase name From operation To operation Notes/conditions
DOORWAY enqueue() at line 2 isHead() at line 3
WAIT isHead() at line 3 write at line 5 no branch at line 3
DONE WAIT write at line 5 dequeue() at line 7
NO WAIT isHead() at line 3 dequeue() at line 7 branch from line 3 to line 6
NEAR NCS dequeue() at line 7 enqueue() at line 2 via the NCS
EXIT dequeue() at line 7 write at line 9 no branch at line 8
Table 2: Process phase definitions.
NEAR NCS

M.enqueue()

DOORWAY
M.isHead()
returns true
qq
qq
xxqq
qq
M.isHead()
returns false
◆◆
◆◆
&&◆
◆◆
◆
NO WAIT
XY✥✥✥
✥
✥
✥
✥
✥
//❫❫❫❫❫❫❫❫❫❫❫❫❫❫❫❫❫❫
WAIT
Wait [p]←− true

XY

✦
✦
✦
✦
✦
✦
✦
Z[
M.dequeue()❴❴❴❴❴❴❴❴❴ ❴❴❴❴❴❴❴❴❴
]\✤
✤
✤
✤
✤
✤
✤
✤
✤
✤
✤
✤
✤
✤
✤
✤
✤
✤
✤
✤
✤
✤
✤
✤
✤
✤
✤
✤
✤
✤
✤
✤
✤
✤
✤
✤
✤
✤
✤
✤
✤
✤
oo❴❴❴❴❴❴❴❴❴❴❴❴❴❴❴❴❴❴❴❴❴❴❴❴❴❴❴❴❴❴
DONE WAIToo
Figure 5: Phase transitions of process p executing Algorithm GQME.
17
To prove the correctness of the algorithm we will establish the following invariant.
Invariant 5.2. Let H ∈ H. Define lastPred(H, p) as the last process enqueued
before p’s last enqueue() operation in H, or ⊥ if no such process exists. Then
MH 6= ⊥ and for every p ∈ P, the following statements hold, collectively denoted
Invariant 5.2–(H, p):
(a) if p = head(MH ) then
NEAR NCS(p)H = false
DOORWAY(p)H =⇒ Wait [p]H = true
WAIT(p)H =⇒ Wait [p]H =


true if lastPred(H, p) 6= ⊥∧
EXIT(lastPred(H, p))H
false otherwise
DONE WAIT(p)H =⇒ Wait [p]H = true
NO WAIT(p)H =⇒ Wait [p]H = true
(b) if p ∈ QProcs(MH) ∧ p 6= head(MH) then
DOORWAY(p)H =
{
true if p 6∈ VisProcs(MH)
false otherwise
WAIT(p)H =
{
true if p ∈ VisProcs(MH)
false otherwise
Wait [p]H = true
(c) if p 6∈ QProcs(MH) then
NEAR NCS(p)H = true
Wait [p]H = true
Theorem 5.3. For any H ∈ H, Invariant 5.2 holds for H.
Proof. We proceed by induction on |H|.
Basis: |H| = 0. In this case, MH is the initial state (〈〉 , ∅), hence MH 6= ⊥.
Since empty(MH) is true, for every p ∈ P parts (a) and (b) of Invariant 5.2–
(H, p) hold trivially (since their antecedents are false), and part (c) holds because
NEAR NCS(p)H and Wait [p]H = true by initialization.
Induction Hypothesis: For any k > 0, assume Theorem 5.3 holds for all histories
H ∈ H such that |H| < k.
Induction Step: We must prove Theorem 5.3 for all H such that |H| = k. Let
σ be the last step in H and let G satisfy H = G ◦ σ. By the IH, MG 6= ⊥ and
Invariant 5.2–(G, p) holds for all p. Define a critical operation as a write operation
18
to an element of Wait or any operation on M (i.e., an operation causing a process to
change phases). If σ is not critical, the fact that Theorem 5.3 holds forG immediately
implies that it also holds for H. Consequently, it suffices to prove that Theorem 5.3
holds for H if σ is a critical step. We proceed by cases on σ.
Case A: step σ is an M.enqueue() by p (see line 2). In this case, p goes from
NEAR NCS to DOORWAY. Since MG 6= ⊥ by the IH and NEAR NCS(p)G, Invari-
ant 5.2–(G, p) implies p 6∈ QProcs(MG), and MH 6= ⊥ holds by the state transition
relation of MutexQueue. Next, note that for every q ∈ P \ {p}, Invariant 5.2–(G, q)
implies Invariant 5.2–(H, q). It remains to show Invariant 5.2–(H, p).
Subcase A1: p = head(MH ). Since NEAR NCS(p)G, Invariant 5.2–(G, p) implies
that Wait [p]G = true. Thus, Wait [p]H = true, and part (a) of Invariant 5.2–(H, p)
holds. Parts (b) and (c) follow trivially since p = head(MH ).
Subcase A2: p 6= head(MH). We have Wait [p]G = true as in subcase A1. Since
p 6∈ VisProcs(MH) by σ, part (b) of Invariant 5.2–(H, p) holds. Parts (a) and (c)
follow trivially since p ∈ QProcs(MH) and p 6= head(MH ).
Case B: step σ is an M.isHead() by p, with response ret (see line 3). In this
case, p goes from DOORWAY to WAIT or NO WAIT. Since DOORWAY(p)G, In-
variant 5.2–(G, p) implies p ∈ QProcs(MG). Moreover, since MG 6= ⊥ by the IH
and G|M |p ends with an enqueue() step, it follows that p 6∈ VisProcs(MG) and
MH 6= ⊥. Furthermore, ret is either true or false by the specification of Mu-
texQueue. As in Case A, Invariant 5.2–(H, q) holds for every q ∈ P \ {p} and it
remains to show Invariant 5.2–(H, p).
Subcase B1: the last step in H (an M.isHead()) returns true. Then p =
head(MH ) by the specification of MutexQueue (sinceMG 6= ⊥) and NO WAIT(p)H
is true by the algorithm. From Invariant 5.2–(G, p) part (a) we have thatWait [p]G =
true, hence Wait [p]H = true. Thus, part (a) of Invariant 5.2–(H, p) holds. Parts
(b) and (c) hold trivially since p = head(MH ).
Subcase B2: the last step in H (an M.isHead()) returns false. Then p 6=
head(MH ) and p ∈ VisProcs(MH) by the specification of MutexQueue (since
MG 6= ⊥) and WAIT(p)H is true by the algorithm. We have Wait [p]H = true
as in subcase B1; since p ∈ VisProcs(MH), part (b) of Invariant 5.2–(H, p) holds.
Parts (a) and (c) hold trivially since p 6= head(MH ) and p ∈ VisProcs(MH).
Case C: step σ is a write of true by p to Wait [p] (see line 5). In this case, p
goes from WAIT to DONE WAIT. It follows that MH 6= ⊥ since MH = MG and
MG 6= ⊥ by the IH. As in the previous case, for all q ∈ P \ {p}, Invariant 5.2–(G, q)
immediately implies that Invariant 5.2–(H, q) holds, and so it remains to show that
Invariant 5.2–(H, p) holds. Since WAIT(p)G, part (c) of Invariant 5.2–(G, p) implies
that p ∈ QProcs(MG). Moreover, Wait [p]G = false by the algorithm since p’s last
read of Wait [p] in G returns false and the value of Wait [p] is not changed until σ
occurs. Since Wait [p]G = false, Invariant 5.2–(G, p) implies p = head(MG). Thus,
p = head(MH ) holds; furthermore, Wait [p]H = true by the effect of the step σ.
This implies part (a) of Invariant 5.2–(H, p). Parts (b) and (c) hold trivially since
p = head(MH ).
19
Case D: step σ is an M.dequeue() by p, with response ret (see line 7). In this case,
p goes from NO WAIT or DONE WAIT to NEAR NCS. Since either NO WAIT(p)G
or DONE WAIT(p)G, Invariant 5.2–(G, p) implies that p = head(MG). Moreover,
since G|M |p ends with an isHead() step, and since MG 6= ⊥ by the IH, ret is either
−1 or succ(MG, p) by the specification of MutexQueue, and p ∈ VisProcs(MG).
Thus, MH 6= ⊥ is true. Now, let s = succ(MG, p), and note that either s = ⊥
and empty(MH), or s = head(MH ). It follows that for every q ∈ P \ {p, s}, In-
variant 5.2–(G, q) implies Invariant 5.2–(H, q). Next, consider Invariant 5.2–(H, p).
To that end, we have p 6∈ QProcs(MH) by the specification of MutexQueue, and
Wait [p]G = true by part (a) of Invariant 5.2–(G, p) (since p = head(MG), as argued
above). Consequently, part (c) of Invariant 5.2–(H, p) holds, and parts (a) and (b)
follow trivially. Finally, we must show Invariant 5.2–(H, s) supposing that s 6= ⊥.
Observe that s 6= ⊥ implies ¬empty(MH) and s = head(MH ) by σ. Moreover,
p 6= s by Observation 5.1 (b) applied to G, so s 6= head(MG).
Subcase D1: step σ (anM.dequeue()) returns−1. It follows that s 6∈ VisProcs(MH).
Consequently, by part(b) of Invariant 5.2–(G, s) we have DOORWAY(s)G andWait [s]G =
true. Thus, DOORWAY(s)H and Wait [s]H = true hold, which implies part(a) of
Invariant 5.2–(H, s). In addition, parts (b) and (c) hold trivially.
Subcase D2: step σ (an M.dequeue()) returns a process ID ret . It follows that
ret = s and s ∈ VisProcs(MH). Consequently, by part (b) of Invariant 5.2–(G, s)
we have WAIT(s)G and Wait [s]G = true. Thus, WAIT(s)H and Wait [s]H = true.
Observe that EXIT(p)H holds by the algorithm, so part (a) of Invariant 5.2–(H, s)
holds. In addition, parts (b) and (c) hold trivially.
Case E: step σ is a write of false by p to Wait [i] for some i (see line 9). In this
case, p leaves EXIT (which is part of NEAR NCS) and remains in NEAR NCS. As
in Case C, it follows that MH 6= ⊥. Let D and E be prefixes of G such that E =
D ◦ 〈(ATOM, p,M, dequeue(), i)〉 and |D| is maximal. Since MG 6= ⊥ it follows that
MD 6= ⊥, ME 6= ⊥, p ∈ QProcs(MD), and p = head(MD). Let s = succ(MD, p),
and observe that, as in Case D, p 6= s, hence s 6= head(MD). Also note that i 6= −1
since p has branched to line 9, hence i = s, s ∈ VisProcs(MD), s ∈ VisProcs(ME),
and s = head(ME). Since, s 6= head(MD) and s ∈ VisProcs(MD), part (b) of
Invariant 5.2–(D, s) implies WAIT(s)D and Wait [s]D = true. Next, note that
p = lastPred(E, s) and that p performs no critical steps in G after E. Moreover,
for every history F such that E  F  G, EXIT(p)F holds and so a straightforward
induction on |F | shows (using part (a) of Invariant 5.2–(F, s)) that s = head(MF ),
Wait [s]F = true, WAIT(s)F , and p = lastPred(F, s). Thus, s = head(MH),
WAIT(s)H , and p = lastPred(H, s) all hold, and Wait [s]H = false by the effect of
step σ. Since ¬EXIT(p)H , part (a) of Invariant 5.2–(H, s) is satisfied. In addition,
parts (b) and (c) hold trivially. Finally, for every q ∈ P\{s}, note that Invariant 5.2–
(H, q) follows immediately from Invariant 5.2–(G, q).
Lemma 5.4. Let H ∈ H and suppose that in H process p executes M.enqueue()
in passage i before q executes M.enqueue() in passage j, and at the end of which q
is in the CS in passage j. Then p has executed M.dequeue() in passage i in H.
20
Proof. Let H, p and q be as in the hypothesis of the lemma and suppose for contra-
diction that p has not executed M.dequeue() in passage i in H. From Theorem 5.3
and Invariant 5.2–(H, q), it follows that MH 6= ⊥ and q = head(MH ). Since p was
enqueued in passage i before q in passage j, this implies that p in passage i has
been dequeued in H. Since MH 6= ⊥, it follows that p has executed M.dequeue()
in passage i in H, which contradicts the original hypothesis.
Corollary 5.5. Algorithm GQME satisfies Mutual Exclusion.
Proof. Suppose for contradiction that there exists an execution history H ∈ H at
the end of which distinct processes p and q are both in the CS, in passages i and
j, respectively. Let H¯ ∈ Lin(H). Without loss of generality, suppose that in H¯,
p executes M.enqueue() in passage i before q executes M.enqueue() in passage
j. Note that at the end of H¯, p and q are both in the CS, in passages i and j,
respectively, and in particular p has not executed M.dequeue() in passage i (since
p has not invoked M.dequeue() in passage i in H). Thus, H¯, p, and q contradict
Lemma 5.4.
21
Corollary 5.6. Algorithm GQME satisfies First-Come First-Served.
Proof. Suppose for contradiction that there exists an execution history H ∈ H
in which process p completes its execution of M.enqueue() in passage i before q
begins its execution of M.enqueue() in passage j, and at the end of which q is in
the CS in passage j but p has not completed the CS in passage i. In particular,
p has not invoked M.dequeue() in passage i in H. Let H¯ ∈ Lin(H). Then p
executes M.enqueue() in passage i before q executes M.enqueue() in passage j in
H¯. Furthermore, at the end of H¯, q is in the CS in passage j but p has not executed
M.dequeue() in passage i (since p has not invoked M.dequeue() in passage i in
H). Thus, H¯, p, and q contradict Lemma 5.4.
Theorem 5.7. Algorithm GQME satisfies Lockout Freedom.
Proof. Suppose for contradiction that there is an infinite fair history H ∈ H in
which some process p begins some passage i and then takes infinitely many steps
but never completes passage i. By the structure of the algorithm, p in passage i
loops forever at line 4, repeatedly reading Wait [p] = true. Let E be a prefix of
H up to but not including the last step (ATOM, p,M, enqueue(), OK) (see line 2).
Choose p so that |E| is minimal. Let F be a prefix of H up to and including the
last step 〈(ATOM, p,M, isHead(), ret)〉 for some response ret . Since p loops forever
at line 4 it follows that F exists, E  F , and ret = false. Furthermore, MF 6= ⊥
by Theorem 5.3, and p 6= head(MF ) since ret = false, so pred(MF , p) 6= ⊥. Let
q = pred(MF , p) and note that since |E| minimal and since H is fair, q eventu-
ally enters phase NEAR NCS in H after F . In particular, q eventually executes
(ATOM, q,M, dequeue(), p), (ATOM, q,Wait [p], write(false), OK), in that order, cor-
responding to line 7 and line 9. Now let G be any prefix of H such that F  G  H,
in which q has executed the above two steps. It follows that Wait [p]G = false,
which contradicts p repeatedly reading Wait [p] = true at line 4 in H after the pre-
fix F . (We do not consider the possibility of p looping forever during a dequeue()
operation on M because we assume in this section that M is an atomic base object.
Later on we will show for each implementation of MutexQueue that each operation
on the implemented object incurs O(1) steps.)
Theorem 5.8. Algorithm GQME satisfies the bounded exit property.
Proof. The result follows directly from the structure of Algorithm GQME. (We do
not consider the number of steps incurred during a dequeue() operation on M
because we assume in this section that M is an atomic base object. Later on we
will show for each implementation of MutexQueue that a call to dequeue() incurs
O(1) steps.)
Theorem 5.9. Algorithm GQME has RMR complexity O(1) per passage in both
the CC and DSM models provided that each operation on M incurs O(1) RMRs.
22
Proof. Note that each passage involves only three MutexQueue operations, at most
two atomic write operations, and an unbounded number of atomic read operations
at line 4. So, it suffices to show that a process performs O(1) remote memory
references at line 4. This is obvious in the DSM model since Wait [p] is local to p,
in which case a process incurs zero RMRs on line 4. Now consider the CC model.
Note that p incurs at most one RMR at line 4 before Wait [p] = true is local to p
(if this ever occurs). Also, p is the only process that can assign Wait [p] = true, so
a subsequent cache miss implies that p reads Wait [p] = false. Thus, p breaks out
of the busy-wait loop at line 4 after at most two RMRs in total.
6 Wait-free Implementation of MutexQueue Using Fetch-
and-Increment
The implementation of an N -process MutexQueue object described in Figure 6 is
based on the mutual exclusion algorithm of T. Anderson [4], as modified by J.
Anderson and Y.-J. Kim for efficient operation in the DSM model (see footnote 7 in
[2]). It relies on a shared object supporting a fetch-and-increment (F&I()) operation,
which atomically increments a variable and returns its previous value. We assume
that this shared object can also be reset to an initial value, e.g., via a write.
Implementation MQFI (Figure 6) explicitly maintains a queue of processes using
a pair of circular arrays. When a process enqueues itself, it obtains an index in
the two arrays by atomically incrementing variable Ctr at line 1. Thus, the set
of processes enqueued at a given time maps to a contiguous (modulo N) block
of array indices. The array Proc stores the IDs of enqueued processes (that are
visible), and array Stat tracks the index of the head element and the visibility of
each process. Roughly speaking, this is done as follows: when a process p enqueues
itself after a predecessor q, it is assigned array index i, where Stat[i] = 0. This
value of Stat[i] indicates that p is neither visible nor the head of the MutexQueue.
Stat[i] later becomes 1 if either p becomes visible or q dequeues itself, making p
the head element. Stat[i] becomes 2 once both p has become visible and q has
dequeued itself. Finally, Stat[i] is reset back to 0 when p dequeues itself. Elements
of Stat are updated atomically using fetch-and-increment to ensure that processes
performing concurrent isHead() and dequeue() operations receive consistent views
of the MutexQueue object (recall the discussion of race conditions in the second
paragraph of Section 1.2).
23
Shared variables:
Stat: array [0..N − 1] of integer 0..2
initially Stat[i] =
{
1 if i = 0
0 otherwise
Proc: array [0..N − 1] of integer 0..N − 1, uninitialized
Ctr: integer, initially zero
Static private (per-process) variables:
index: integer 0..N − 1, uninitialized
Procedure for operation enqueue() by process p:
index := Ctr.F&I() mod N1
return OK2
Procedure for operation isHead() by process p:
Proc[index].write(p)3
return Stat[index].F&I() = 14
Procedure for operation dequeue() by process p:
Stat[index].write(0)5
if Stat[(index+ 1) mod N ].F&I() = 1 then6
return Proc[(index + 1) mod N ].read()7
else
return −18
end
Figure 6: Implementation MQFI (N -process MutexQueue implementation using
Fetch-and-Increment).
24
6.1 Proof of Correctness
We denote Implementation MQFI (shown in Figure 6) of typeMutexQueue formally
as IMQFI = (P,V,H) where P = {0..N − 1} and V consists of: the base objects
{Ctr, Stat[0..N − 1], Proc[0..N − 1]}, denoted subsequently as the set B, and a
target object M . Each H ∈ H is a two-level execution history where processes call
the procedures enqueue(), isHead() and dequeue() as explained in Section 4. For
each such procedure call, H records an invocation step on M for the corresponding
operation and, if the procedure call terminates, a matching response step onM with
a response equal to the value returned by the procedure call. Similarly, H contains
an atomic step for each operation that a process applies to one of the base objects
B.
Implementation IMQFI simulates a MutexQueue object that can be used in Al-
gorithm GQME (Figure 4) provided that it is linearizable with respect to the Mu-
texQueue type, and that each call to an access procedure incurs O(1) steps. The
latter property follows easily from the structure of the access procedures and also
implies O(1) RMR complexity. Therefore, we focus at linearizability. Specifically,
we must show that for every H ∈ H, H|M is linearizable with respect to the Mu-
texQueue type. To that end, we will explicitly construct a candidate linearization
H¯ of H|M , and prove that M conforms to the MutexQueue type. We will do this
using an invariant that relates the state of the base objects to the “linearized state”
of M , which is determined by our candidate linearization.
Given H ∈ H, we construct H¯ as follows. Recall that in Figure 6, the access
procedure for each operation of type MutexQueue contains one or more accesses
to base objects. For each MutexQueue operation execution in H, we define one
of these base object accesses as the linearization point of that operation execution.
Intuitively (and as we will prove in Theorem 6.3), the order of the linearization
points determines the order in which the MutexQueue operations that contain them
are linearized. Specifically, the linearization point of
• an enqueue() operation execution is the base object step Ctr.F&I() at line 1;
• an isHead() operation execution is the base object step Stat[index].F&I() at
line 4;
• a dequeue() operation execution in which Stat[(index + 1) mod N ].F&I() at
line 6 returns 1 is the base object step Proc[(index + 1) mod N ].read() at
line 7; and
• a dequeue() operation execution in which Stat[(index + 1) mod N ].F&I() at
line 6 returns a value other than 1 is that base object step itself.
Note that the response of a MutexQueue operation execution is uniquely determined
if its linearization point has occurred. For enqueue(), the response is always OK.
For isHead(), the response is true if and only if the linearization point’s response
is 1. For dequeue(), the response is the response of the linearization point, if the
F&I() at line 6 returns 1; and −1, otherwise.
25
For any H ∈ H, let H¯ denote the complete sequential history over M defined
below, based on the linearization points present in H:
• H¯ contains each operation execution invoked inH|M whose linearization point
appears in H, with the response determined by this linearization point, and
no other steps.
• Operation executions in H¯ occur in the same order as the corresponding lin-
earization points in H.
Note that, by definition, H¯ is a history over the target object M , so we can use
the notation succ(M H¯ , p) and pred(M H¯ , p) defined in Section 5.1. We also make
extensive use of the following notation: indexHp is the last value read from Ctr by p
in H, reduced mod N , or ⊥ if H|p|Ctr = 〈〉. Informally, indexHp denotes the value
of the private variable indexp at the end of H, assuming that index is updated
atomically with the response of Ctr.F&I() at line 1 of enqueue().
Observation 6.1. For any G,H ∈ H such that G  H, G¯  H¯.
Informally, the following lemma says that two processes currently in the queue
cannot be assigned the same array index.
Lemma 6.2. For any H ∈ H and for any p, q ∈ P suppose that H¯ ∈ Lin(H|M),
M H¯ 6= ⊥, p ∈ QProcs(M H¯), q ∈ QProcs(M H¯), and indexHp = index
H
q . Then
p = q.
Proof. Suppose for contradiction that p 6= q. Without loss of generality, assume
that p’s last enqueue() in H¯ precedes q’s last enqueue(). Then q is the k’th
process enqueued after p in H¯ for some k = mn and some m ≥ 1. Let (Q,V ) =
M H¯ . It follows that |Q| ≥ k + 1 (i.e., Q contains at least p and a chain of k
successors up to and including q). Since m ≥ 1 it follows that k + 1 > N , so by the
pigeonhole principle Q contains two instances of some element, which contradicts
Observation 5.1 (b).
Next, define a bad MutexQueue operation execution as one that violates the
access etiquette for MutexQueue. More precisely, if H ∈ H then a MutexQueue
operation execution oe by process p in H is bad if and only if there exists a prefix
G of H that contains the invocation of oe but not its linearization point, such that
G¯ ∈ Lin(G|M), M G¯ 6= ⊥, and one of the following holds:
• oe is enqueue() and p ∈ QProcs(M G¯)
• oe is isHead() and either p 6∈ QProcs(M G¯) or p ∈ VisProcs(M G¯)
• oe is dequeue() and either p 6= head(M G¯) or p 6∈ VisProcs(M G¯)
The following theorem establishes the correctness of Implementation MQFI.
26
Theorem 6.3. For any H ∈ H, H|M is linearizable with respect to type Mu-
texQueue.
Proof. We will prove by induction on |H| the following claim:
IfH does not contain any bad operation executions then H¯ ∈ Lin(H|M),
M H¯ 6= ⊥, and the values of the elements of Stat at the end of H are as
follows:
Stat[i]H =


2 if ∃p ∈ P : indexHp = i ∧ p = head(M
H¯ ) ∧ p ∈ VisProcs(M H¯)
∧ in H, p has not written Stat[i] at line 5 of dequeue() since last
invoking enqueue()
2 if ∃p ∈ P : indexHp = i ∧ p 6= head(M
H¯ ) ∧ p ∈ VisProcs(M H¯)
∧ pred(M H¯ , p) 6= ⊥ ∧ pred(M H¯ , p) is between lines 6 and 7
of dequeue() in H
1 if ∃p ∈ P : indexHp = i ∧ p 6= head(M
H¯ ) ∧ p ∈ VisProcs(M H¯)
∧ (pred(M H¯ , p) = ⊥ ∨ pred(M H¯ , p) 6= ⊥ ∧ pred(M H¯ , p) is not
between lines 6 and 7 of dequeue() in H)
1 if ∃p ∈ P : indexHp = i ∧ p = head(M
H¯ ) ∧ p 6∈ VisProcs(M H¯)
1 if empty(M H¯) ∧ i = CtrH mod N
0 otherwise
Informally, the above statement means the following. When p enqueues itself,
Stat[indexHp ] is 1 if p is the head ofM and 0 otherwise. Stat[index
H
p ] is subsequently
incremented once when p becomes visible, and once when p becomes the head (or is
about the become the head and its predecessor has partially completed dequeue()).
The latter two operations may happen in either order. Finally, Stat[indexHp ] returns
to 0 once p is visible, is the head of M and has begun dequeuing itself (i.e., executed
line 5). Furthermore, when M is empty, Stat[i] = 1 if i is the array index that will
be assigned to the next process that enqueues itself, and Stat[i] = 0 otherwise.
Note that, by Lemma 6.2, for every i ∈ [0..N − 1], there is at most one p ∈
QProcs(M H¯) such that i = indexHp .
In the remainder of the proof we denote the predicate that Stat[i]H has the value
specified above by β(H, i).
Basis: |H| = 0. It follows that H = H¯ = 〈〉, so certainly H¯ ∈ Lin(H|M). More-
over, empty(M H¯) holds, so β(H, i) follows from the initialization of Implementa-
tion MQFI, for all i ∈ [0..N − 1].
Induction Hypothesis: For any l > 0, assume that Theorem 6.3 holds for every
H such that |H| < l.
Induction Step: We must prove Theorem 6.3 for every H such that |H| = l. Let
G be a prefix of H of length l − 1. We proceed by cases on the last step σ in H.
Cases A–G are when H ends with an atomic base object step and Case H is when H
ends with a non-atomic step on the target object M . In all these cases we assume
27
that H does not contain a bad MutexQueue operation execution. Finally, Case I is
when H does contain a bad MutexQueue operation execution.
Case A: step σ is a Ctr.F&I() (see line 1 of enqueue()) In this case,
H¯ = G¯ ◦ 〈(INV, p,M, enqueue()), (RES, p,M, OK)〉
and p 6∈ QProcs(M G¯), since H does not contain a bad operation execution. Then
certainly H¯ ∈ Lin(H|M), and M H¯ 6= ⊥. Furthermore, p ∈ QProcs(M H¯), and
p 6∈ VisProcs(M H¯). Next, note that CtrH = CtrG + 1 and Stat[i]H = Stat[i]G for
all i ∈ [0..N − 1]. Let j = indexHp (i.e., j = Ctr
G mod N). It remains to show
β(H, i) for all i ∈ [0..N − 1]. For i 6= j it follows from the IH that Stat[i]H has the
value stipulated by β(H, i). Finally, consider Stat[j]H .
Subcase A-i: empty(M G¯). Then p = head(M H¯ ) and p 6∈ VisProcs(M H¯), so we
must show that Stat[j]H = 1 (see fourth clause in the definition of Stat[j]H). But
this follows from Stat[j]H = Stat[j]G and β(G, j) (fifth clause), as wanted.
Subcase A-ii: ¬empty(M G¯). Then p 6= head(M H¯ ) and p 6∈ VisProcs(M H¯), so
we must show that Stat[j]H = 0 (see fifth clause in definition of Stat[j]H). By
Lemma 6.2, there is no q ∈ QProcs(M G¯) such that q 6= p and indexHq = j, so
Stat[j]H = 0 follows from Stat[j]H = Stat[j]G and β(G, j) (sixth clause), as wanted.
Case B: step σ is a Proc[indexGp ].write(p) (see line 3 of isHead()). In this case,
H¯ = G¯, so M H¯ = M G¯ and M H¯ 6= ⊥ since M G¯ 6= ⊥ by the IH. Furthermore,
G|M = H|M , so H¯ ∈ Lin(H|M) since G¯ ∈ Lin(G|M) by the IH.
Case C: step σ is a Stat[indexGp ].F&I() with response r for some r (see line 4 of
isHead()). In this case,
H¯ = G¯ ◦ 〈(INV, p,M, isHead()), (RES, p,M, ret)〉
where ret = true if r = 1 and ret = false otherwise. Furthermore, p ∈ QProcs(M G¯)
and p 6∈ VisProcs(M G¯) since H does not contain a bad operation execution. Let
j = indexHp . Then by β(G, j), r = 1 if p = head(M
G¯) and r = 0 otherwise, so it
follows that H¯ ∈ Lin(H|M) and M H¯ 6= ⊥. Furthermore, p ∈ QProcs(M H¯) and
p ∈ VisProcs(M H¯) hold. It remains to prove β(H, i) for i ∈ [0..N − 1]. For i 6= j,
we have Stat[i]H = Stat[i]G, and β(G, i) implies β(H, i). Finally, consider Stat[j]H .
Note that Stat[j]H = Stat[j]G+1, by the effect of the operation under consideration
in this case.
Subcase C-i: p = head(M G¯). Then p = head(M H¯ ), and we must show Stat[j]H =
2 since p ∈ VisProcs(M H¯) (see first clause in definition of Stat[j]H), i.e., we must
show that Stat[j]G = 1. But this follows from β(G, j) (fourth clause).
Subcase C-ii: p 6= head(M G¯). Then p 6= head(M H¯ ), and we must show that
Stat[j]H ∈ {1, 2} since p ∈ VisProcs(M H¯), (see second and third clause in the defi-
nition of Stat[j]H). Since p 6= head(M G¯), p ∈ QProcs(M G¯) (so M G¯ is not empty),
and p 6∈ VisProcs(M G¯), β(G, j) implies that Stat[j]G = 0, hence Stat[j]H = 1, as
wanted.
Case D: step σ is a Stat[indexGp ].write(0) (see line 5 of dequeue()). In this case,
H¯ = G¯; thus H¯ ∈ Lin(H|M) (since, by the IH, G¯ = H¯ is a linearization of G|M =
28
H|M), and M H¯ 6= ⊥ (since M G¯ 6= ⊥ by the IH). Furthermore, p ∈ QProcs(M G¯),
p ∈ VisProcs(M G¯) and p = head(M G¯) since H does not contain a bad operation
execution, hence p ∈ QProcs(M H¯), p ∈ VisProcs(M H¯) and p = head(M H¯ ). It
remains to prove β(H, i) for all i ∈ [0..N − 1] Let j = indexHp and note that for all
i ∈ [0..N − 1], i 6= j, β(H, i) follows directly from β(G, i). Finally, β(H, j) holds
since p = head(M H¯ ), p ∈ VisProcs(M H¯), and Stat[j]H = 0 by the effect of step σ.
(See the sixth clause in the definition of Stat[j]H , noting that, at the end of H, p
has just completed line 5.)
Case E: step σ is a Stat[(indexGp +1) mod N ].F&I() that returns ret 6= 1 (see line 6
of dequeue()). In this case,
H¯ = G¯ ◦ 〈(INV, p,M, dequeue()), (RES, p,M,−1)〉 .
Since, by assumption, H contains no bad operation executions, p ∈ QProcs(M G¯),
p ∈ VisProcs(M G¯) and p = head(M G¯). By the IH, M G¯ 6= ⊥ and so M H¯ 6= ⊥.
Let j = indexHp , k = j + 1 mod N , and q = succ(M
G¯, p). Thus, if q 6= ⊥ then
q = head(M H¯ ). Furthermore, we claim that if q 6= ⊥ then q 6∈ VisProcs(M G¯). For,
if not, β(G, k) (third clause) would imply that Stat[k]G = 1, which would contradict
the hypothesis of the case – specifically that ret 6= 1. Recall from the transition
function of MutexQueue that a dequeue() operation applied to a state in which
the head of the queue has no successor or has a successor that is not visible returns
−1. Thus, H¯ ∈ Lin(H|M), as wanted. It remains to show that β(H, i) holds for all
i ∈ [0..N − 1]. This follows immediately by the IH β(G, i) for all i 6= j, k.
To see that β(H, j) holds, we must prove that Stat[j]H = 0. (This is because
j = indexHp , p 6= head(M
H¯ ) and p 6∈ VisProcs(M H¯), so clause six applies in
the definition of Stat[j]H . We assume here that N > 1, so if empty(M H¯) then
j 6= CtrH mod N since k = CtrH mod N and j 6= k. The case N = 1 is easy
to show, noting that j = k.) Since Stat[j]H = Stat[j]G, it suffices to prove that
Stat[j]G = 0. Observing that j = indexGp , p = head(M
G¯), p ∈ VisProcs(M G¯) and
in G, p has executed line 5 of dequeue() since its last invocation of enqueue(), we
conclude (see clause six in the definition of Stat[j]G) that, Stat[j]G = 0, as wanted.
Finally, to see that β(H, k) holds, we consider two cases.
Subcase E-i: q = ⊥. In this case, empty(M H¯) and k = CtrH mod N . Thus, we
must prove that Stat[k]H = 1 (see fifth clause in the definition of Stat[k]H). By the
IH, Stat[k]G = 0 (see sixth clause in the definition of Stat[k]G). By the effect of
step σ, Stat[k]H = Stat[k]G + 1. Thus, Stat[k]H = 1, as wanted.
Subcase E-ii: q 6= ⊥. As argued above, in this case q 6∈ VisProcs(M G¯), hence
q 6∈ VisProcs(M H¯). Furthermore, q = head(M H¯ ). Thus, we must prove that
Stat[k]H = 1 (see fourth clause in the definition of Stat[k]H). We also have q 6=
head(M G¯) (because p = head(M G¯) and q, p’s successor, cannot be the same as p
by Observation 5.1 (b)). Thus, by the IH, Stat[k]G = 0 (see sixth clause in the
definition of Stat[k]G). By the effect of step σ, Stat[k]H = Stat[k]G + 1. Thus,
Stat[k]H = 1, as wanted.
Case F: step σ is a Stat[(indexGp +1) mod N ].F&I() with return value 1 (see line 6
of dequeue()). In this case, H¯ = G¯; thus H¯ ∈ Lin(H|M) (since, by the IH,
29
G¯ = H¯ is a linearization of G|M = H|M), and M H¯ 6= ⊥ (since M G¯ 6= ⊥ by
the IH). Since H does not contain a bad operation execution, p ∈ QProcs(M G¯),
p ∈ VisProcs(M G¯) and p = head(M G¯). Let j = indexHp , k = j + 1 mod N , and
q = succ(M G¯, p). It remains to prove that β(H, i) holds for all i ∈ [0..N − 1]. This
follows immediately by the IH β(G, i) for all i 6= j, k. The argument proving that
β(H, j) holds is exactly as in Case E. Finally, consider β(H, k). Since k = indexGp ,
q = succ(M G¯, p), q 6= head(M G¯) (by Observation 5.1 (b)), and Stat[k]G = 1 by the
hypothesis of this case, it follows by the IH β(G, k) that q ∈ VisProcs(M G¯) (see
clause three of the definition of Stat[k]G). Furthermore, since H does not contain
any bad operation executions by the IH, q is not executing a pending dequeue() in
G, and has not yet reached line 5 since last invoking enqueue(). Thus, k = indexHp ,
q ∈ VisProcs(M H¯), Stat[k]H = Stat[k]G + 1 = 2 and q = head(M H¯) by the effect
of step σ, so β(H, k) holds (see first clause in the definition of Stat[k]H).
Case G: step σ is a Proc[(indexGp + 1) mod N ].read() that returns ret for some
ret (see line 7 of dequeue()). In this case,
H¯ = G¯ ◦ 〈(INV, p,M, dequeue()), (RES, p,M, ret)〉 .
Let j = indexHp , k = j + 1 mod N , and q = succ(M
G¯, p). Note that p ∈
QProcs(M G¯), p ∈ VisProcs(M G¯) and p = head(M G¯) as in Case E, so H¯ ∈
Lin(H|M) provided that ret = q. Also note that q 6= ⊥, since if F  G where F ends
just before p’s last F&I() operation (i.e., line 6 of dequeue()) then succ(M F¯ , p) 6= ⊥
follows from the arguments in Case F, and succ(M F¯ , p) = succ(M G¯, p). Similarly, it
follows that q ∈ VisProcs(M G¯) and that q has not begun executing dequeue() by
the end of G. From Lemma 6.2 and Implementation MQFI, it follows that no process
has overwritten Proc[k] since q last wrote it, so Proc[k] = q, and ret = q, which im-
plies that H¯ ∈ Lin(H|M), as wanted. Now, β(H, i) for i ∈ [0..N−1], i 6= k follows di-
rectly from β(G, i). Finally, β(G, k) implies that Stat[k]G = 2 since q 6= head(M G¯),
q ∈ VisProcs(M G¯), and pred(H¯, q) = p is between lines 6 and 7 at the end of
G (see second clause in definition of Stat[k]G). Since Stat[k]H = Stat[k]G = 2,
q = head(M H¯ ), q ∈ VisProcs(M H¯), and q has not started dequeue() by the end of
H, it follows that β(H, k) holds (see first clause in definition of Stat[k]H).
Case H: step σ is a non-atomic step on the target object M by process p.
Subcase H-i: σ is an invocation step. Then H¯ = G¯ by definition since the lin-
earization point of every MutexQueue operation occurs after the initial invocation
step. Furthermore, G¯ ∈ Lin(H|M) since G¯ ∈ Lin(G|M) and H = G ◦ 〈sI〉 where sI
is an invocation. Thus, H¯ ∈ Lin(H|M), and M H¯ 6= ⊥ since M G¯ 6= ⊥ by the IH.
Subcase H-ii: σ is a response step. Then the linearization point of the operation
execution corresponding to σ has occurred in G, and so G¯ contains this operation
execution. Since G¯ ∈ Lin(G|M) by the IH, it follows that G¯ ∈ Lin(H|M) provided
that σ and the last step in H¯|p have equal return values. But the latter follows from
our construction of H¯. (Recall that for an operation execution that is pending in H,
if the linearization point has occurred then the operation execution is completed with
a matching response step in H¯ that returns the uniquely-determined return value of
30
the access procedure.) Similarly, it follows that H¯ = G¯. Thus, H¯ ∈ Lin(H|M) and
M H¯ 6= ⊥ since G¯ ∈ Lin(G|M) and M G¯ 6= ⊥ by the IH.
Case I: H contains a bad MutexQueue operation. Let F be the prefix of H up
to but not including the first invocation step σI of a bad MutexQueue operation
execution. By the IH, F¯ ∈ Lin(F |M) and M F¯ 6= ⊥. To obtain a linearization
of H|M , first let L = G¯ ◦ 〈σI , σR〉 where σR is a response matching σI , with an
arbitrary return value. Since σI corresponds to a bad operation execution, it follows
that L ∈ Lin((G ◦ 〈σI〉)|M), and that M
L = ⊥. Finally, form L′ by appending to
L a complete operation execution on M for all remaining operation executions in
H|M (i.e., those that have been invoked but are not present in L), say in the order
of their invocation steps in H. Once again assign the return value for each such
operation execution arbitrarily. Since ML = ⊥, it follows that L′ ∈ Lin(H|M).
6.1.1 RMR Complexity
Each access procedure of Implementation MQFI performs O(1) steps since there are
no loops. In particular, the RMR complexity of each access procedure is O(1).
6.1.2 Bounded Memory Implementation
A drawback of the above implementation is that Ctr grows without bound. We
now discuss how to implement Ctr using bounded memory. One approach, used by
[3], is to atomically subtract N from Ctr whenever N − 1 is fetched from the F&I()
at line 2 of enqueue(). This ensures that Ctr never grows beyond 2N − 1 (since
at most N − 1 other processes can increment Ctr before N is subtracted). The
drawback of this solution is that a fetch-and-add primitive is needed in addition to
(or in place of) fetch-and-increment. Another solution, brought to our attention by
Prasad Jayanti, is to allow Ctr to overflow, provided that it returns to zero without
halting the execution. In particular, if Ctr is an unsigned m-bit integer and N
divides 2m, then it is easy to see that Implementation MQFI remains correct (i.e.,
the values assigned to index are as before).
31
7 Wait-free Implementation of MutexQueue Using Fetch-
and-Store
The implementation of an N -process MutexQueue presented in Figure 7 is based on
the mutual exclusion algorithm of Craig [8, 7], in particular a variant brought to our
attention by Prasad Jayanti. It relies on a shared object supporting a fetch-and-
store (F&S) operation, which atomically writes a variable and returns its previous
value. Without loss of generality, we assume that such an object also supports an
ordinary write operation. (One can always simulate a write by applying a F&S and
ignoring the response.)
Informally, Implementation MQFS (Figure 7) works as follows. At each point
in time each process p “owns” exclusively an index myIdx p of array Queue; the
index owned by p changes each time the process dequeues itself (see line 10). For
this reason Queue has N + 1 entries; if a process dequeues itself at a time when all
others are enqueued, it needs to acquire an index different from those owned by the
other processes and from the index it previously owned.
The processes currently in the queue implicitly form a list, the first element of
which is the head of the queue. The shared variable Last contains the index owned
by the last process in the queue. (Whenever the queue is empty, Last contains an
index not currently owned by any process.) When process p enqueues itself it uses
F&S on Last to find out its predecessor’s index (which p records in prevIdx p) and to
atomically swap its own index into Last (see line 2). The use of F&S to atomically
read and update Last ensures the integrity of the list of processes waiting in the
queue; it is not possible for two processes getting enqueued concurrently to consider
the same process as their predecessor.
Recall from the specification of MutexQueue that the operation isHead() has
two objectives: (a) to determine whether the process p executing the operation is
the head of the queue, and (b) to make p visible to its predecessor, thereby ensuring
that when the predecessor dequeues itself, it will “wake up” p. In addressing the
second objective we must contend with the possibility of p becoming visible to its
predecessor just as that predecessor is dequeueing itself. This race condition is
handled by appropriate use of F&S. We now explain how the implementation of
MutexQueue achieves these two objectives.
When process p enqueues itself, it sets Queue[myIdx p] = (myIdx p, p) (see line 1).
When p dequeues itself, it sets Queue[myIdx p] to a value different from (myIdx p,−),
specifically to (prevIdx p, p) (see line 6).
5 (We use F&S for this assignment because
of the race condition mentioned above, as we will explain shortly.)
When it executes operation isHead(), process p signals its predecessor that it
has become visible by swapping the index it owns, myIdx p, and its ID, into the pre-
decessor’s position of array Queue, namely Queue[prevIdx p]; it records the old value
of Queue[prevIdx p] in tempIdx p and tempIdp (see line 4). With this information,
p can determine if it is the head of the queue: this is the case if and only if its
predecessor had dequeued itself by the time p signalled that it is visible, i.e., if and
5In this context “−” denotes a wildcard value.
32
only if tempIdx p 6= prevIdx p (see line 5).
Finally, we explain how a process p that is dequeuing itself ensures that it “wakes
up” its successor, provided that the latter is visible. As we have seen, when p de-
queues itself, it swaps (prevIdx p, p) (where prevIdx p 6= myIdx p) into Queue[myIdx p],
and records the old value of Queue[myIdx p] into tempIdx p and tempIdp (see line 6).
There are two cases, depending on the value of tempIdx p.
1. Process p finds that tempIdx p 6= myIdx p. In this case, p’s successor q must have
executed line 4 and swapped (myIdx q, q) (where myIdx q 6= myIdx p since no
two processes can own the same index at the same time) into Queue[prevIdx q],
i.e., into Queue[myIdx p] (since p is q’s predecessor). This means that q became
visible before p dequeued itself, and so p is in charge of waking up q when it
is dequeued. Indeed, in this case, p’s call to isHead() returns q’s ID at line 8.
2. Process p finds that tempIdx p = myIdx p. In this case, Queue[myIdx p] has not
been changed by p’s successor since the time when p enqueued itself and wrote
myIdx p into Queue[myIdx p] (see line 1). This means that the successor of p
is not yet visible and so p is not responsible for waking it up. Accordingly, in
this case p’s dequeue() operation returns −1 (see line 9).
7.1 Proof of Correctness
We proceed using the same approach as in Section 6. We denote Implementa-
tion MQFS (shown in Figure 7) of typeMutexQueue formally as IMQFS = (P,V,H)
where P = {0..N − 1} and V consists of: the base objects {Last, Queue[0..N − 1]},
denoted subsequently as the set B, and a target object M . Histories in H model
the execution of Implementation MQFS in a sense analogous to the one defined in
Section 6.1 for Implementation MQFI. As before, it follows easily that each call to
an access procedure incurs O(1) steps, and so we focus on linearizability. To that
end, we define for any H ∈ H a candidate linearization H¯ using the same approach
as in Section 6.1. We also define bad operation executions exactly as in Section 6.1.
For the candidate linearization, we define the linearization point of
• an enqueue() operation execution is the base object step Last .F&S(myIdx ) at
line 2; and
• an isHead() operation execution is the base object step Queue[prevIdx ].F&S
at line 4; and
• a dequeue() operation execution is the base object step Queue[myIdx ].F&S at
line 6.
Note that as in Section 6.1, the response of a MutexQueue operation execution
is determined uniquely if its linearization point has been reached. For enqueue(),
the response is always OK. For isHead(), the response is true if and only if the
linearization point’s response is different from the value of prevIdx for the calling
33
Shared variables:
Queue: array [0..N ] of integer 0..N , initially Queue[i] 6= i
Last : integer 0..N , initially N
Static private (per-process) variables:
myIdx : integer 0..N , initially p for process p
prevIdx : integer 0..N , uninitialized
tempIdx : integer 0..N , uninitialized
tempId : integer 0..N − 1, uninitialized
Procedure for operation enqueue() by process p:
Queue[myIdx ].write((myIdx , p))1
prevIdx := Last .F&S(myIdx )2
return OK3
Procedure for operation isHead() by process p:
(tempIdx , tempId) := Queue[prevIdx ].F&S((myIdx , p))4
return tempIdx 6= prevIdx5
Procedure for operation dequeue() by process p:
(tempIdx , tempId) := Queue[myIdx ].F&S((prevIdx , p))6
if tempIdx 6= myIdx then7
ret := tempId8
else
ret := −19
end
myIdx := prevIdx10
return ret11
Figure 7: Implementation MQFS (N -process MutexQueue implementation using
Fetch-and-Store).
34
process. For dequeue(), the response is −1 if the F&S at line 6 returns an ordered
pair of the form (myIdx ,−), and is the second element in this ordered pair otherwise.
In the proof of correctness of Implementation MQFS it will be useful to refer to
the values of private variables at the end of histories in H. Let H ∈ H be a history
such that H¯ ∈ Lin(H|M)6 and M H¯ 6= ⊥. Let vp be a private variable of process p
(i.e., one of myIdx p, prevIdx p or tempIdx p). We use v
H
p to denote the value of vp
at the end of H, assuming that each assignment to a private variable of p occurs at
the same time as the response of the last base object step by p that precedes that
assignment in the execution corresponding to H. Below we also use the notion of
bad operation executions, defined exactly as in Section 6.1.
For any H ∈ H, p ∈ P and i ∈ [0..N ], we say that p owns i at the end of H if
and only if myIdxHp = i.
We now state two observations in connection with the above definitions. Infor-
mally, these say that:
(a) The value of myIdx p after p performs a dequeue() operation is the value
of prevIdx p when p performed the preceding enqueue(). Intuitively, this is
because of line 10.
(b) The value of prevIdx p after p has enqueued itself is the value that myIdx q had
when q was last in the queue, where q is the processes that entered the queue
just before p. Intuitively, this is because of line 2.
More formally, we have:
Observation 7.1. Let H ∈ H be a history where H¯ ∈ Lin(H|M) and M H¯ 6= ⊥,
and let G  H (note that G¯  H¯).
(a) Let p be any process such that p ∈ QProcs(M G¯) and p executes dequeue()
exactly once in H¯ after G¯. Then prevIdxGp = myIdx
H
p .
(b) Let p, q be any processes such that q ∈ QProcs(M G¯), p ∈ QProcs(M H¯), q
is the process that executes the last enqueue() preceding the last enqueue()
of p in H¯, and q executes dequeue() at most once in H¯ following G¯. Then
myIdxGq = prevIdx
H
p .
Lemma 7.2. Let H ∈ H be a history where H¯ ∈ Lin(H|M) and M H¯ 6= ⊥. Then
the following statements hold:
(1) ∀x, y ∈ P, x 6= y =⇒ myIdxHx 6= myIdx
H
y
(2) ∀x, y ∈ QProcs(M H¯), x 6= y =⇒ prevIdxHx 6= prevIdx
H
y
(3) ∀x ∈ P, y ∈ QProcs(M H¯), if myIdxHx = prevIdx
H
y then y = succ(M
H¯ , x)
6Lin(H |M) is the set of linearizations of H |M , as defined in Section 4.
35
(4) ∀x ∈ QProcs(M H¯), myIdxHx 6= prevIdx
H
x
Proof. We proceed by induction on |H|. It suffices to prove (1)–(3) since (4) follows
immediately from (3): if p ∈ QProcs(M H¯) andmyIdxHp = prevIdx
H
p then (3) implies
that p = succ(M H¯ , p), which contradicts Observation 5.1.
Basis: |H| = 0. It follows that H¯ = H = 〈〉. By initialization, myIdxHp = p and
p 6∈ QProcs(M H¯) hold for every p ∈ P, and so (1)–(3) hold for H.
Induction Hypothesis: For any l > 0, assume that Lemma 7.2 holds for all H
such that |H| < l.
Induction Step: We must prove Lemma 7.2 for every H such that |H| = l. Let G
be a prefix of H of length l− 1. We proceed by cases on the last step σ in H. Since
M H¯ 6= ⊥, it follows that M G¯ 6= ⊥.
Case A: G¯ = H¯ or σ is the linearization point of isHead() (line 4). In this
case, for each p ∈ P, myIdxGp = myIdx
H
p and prevIdx
G
p = prevIdx
H
p . Moreover,
QProcs(M G¯) = QProcs(M H¯). Thus, the fact that the lemma holds for H follows
directly from the fact that (by the IH) it holds for G.
Case B: σ is the linearization point of M.enqueue() by process p. Lemma 7.2 (1)
for H follows directly from the IH since for every x ∈ P, myIdxGx = myIdx
H
x . It
remains to prove parts (2) and (3) of the lemma for H.
Subcase B1: p = head(M H¯ ). It follows that M G¯ is empty and QProcs(M H¯)
contains only p, and so Lemma 7.2 (2) holds trivially for H. Now let j = prevIdxHp .
To prove part (3), it suffices to show that no process z ∈ P owns j at the end
of H. Suppose for contradiction that for some z ∈ P myIdxHz = j. It follows
that H¯ contains more than one M.enqueue(), otherwise j = N and myIdxHz = z
where z 6= N . Let r be the process that executes the last enqueue() preceding
the last enqueue() of p in H¯. Let F be the prefix of H up to but not including
the linearization point of the last M.dequeue() performed by r; this is well-defined
because M G¯ is empty. By Observation 7.1 (b) and the fact that j = prevIdxHp ,
it follows that myIdxFr = j. Also note that no process other than r applies a
MutexQueue operation execution in G¯ after F¯ . There are two cases, each leading
to a contradiction.
• If z 6= r then myIdxFz = j since myIdx
H
z = j and z does not execute dequeue()
in G¯ after F¯ . At the same time myIdxFr = j, as argued above. But myIdx
F
z = j
and myIdxFr = j contradict part (1) of the IH for F since z 6= r.
• If z = r then by Observation 7.1 (a) and the fact that myIdxHz = j, prevIdx
F
z =
j and hence prevIdxFr = j. At the same time, myIdx
F
r = j, as argued above.
Furthermore, r ∈ QProcs(M F¯ ) by definition of r and F . But prevIdxFr = j,
myIdxFr = j and r ∈ QProcs(M
F¯ ) contradict part (4) of the IH for F .
Thus, Lemma 7.2 (3) holds for H.
Subcase B2: p 6= head(M H¯ ). Let r = pred(M H¯ , p) and let j = prevIdxHp .
36
First, consider Lemma 7.2 (2) for H. For every q ∈ P\{p}, prevIdxGq = prevIdx
H
q
and q ∈ QProcs(MG) ⇔ q ∈ QProcs(MH) hold, so it suffices to show that there
is no z ∈ QProcs(M G¯) such that prevIdxGz = j. Suppose for contradiction that
prevIdxGz = j for some z ∈ QProcs(M
G¯). Observe that r ∈ QProcs(M G¯) by the
definition of G and the hypothesis of Subcase B2, and that succ(M G¯, r) = ⊥ by the
definition of G and the hypothesis of Case B. Since prevIdxHp = j, it follows from the
definition of r and G and Observation 7.1 (b) that myIdxGr = j. Since prevIdx
G
z = j
and z ∈ QProcs(M G¯) by assumption, part (3) of the IH for G implies that z =
succ(M G¯, r). But this contradicts the earlier observation that succ(M G¯, r) = ⊥.
Next, consider part (3) of the lemma. It suffices to show that for any q ∈ P,
if myIdxHq = j then p = succ(M
H¯ , q). By part (1) of the IH for G, r is the only
process that owns myIdxGr at the end of G, and so by the hypothesis of Case B, r is
the only process that owns myIdxHr at the end of H. By definition, r = pred(M
H¯ , p)
and so p = succ(M H¯ , r). Thus, Lemma 7.2 (3) holds for H.
Case C: σ is the linearization point of M.dequeue() by process p. Note that
p = head(M G¯) since M H¯ 6= ⊥. Let j = prevIdxGp .
First, consider Lemma 7.2 (1) for H. Note that for every q ∈ P \{p}, myIdxGq =
myIdxHq holds. Furthermore, myIdx
H
p = j by line 10 and Observation 7.1 (a). It
suffices to show that no process owns j at the end of H. Suppose for contradiction
that myIdxHz = j for some z ∈ P \ {p}. Then myIdx
G
z = j, prevIdx
G
p = j and
p ∈ QProcs(M G¯) all hold by definition of G and the hypothesis of Case C, so by
part (3) of the IH for G it follows that p = succ(M G¯, z). But this contradicts the
earlier observation that p = head(M G¯).
Next, consider Lemma 7.2 (2) for H. Note that for every q ∈ P\{p}, prevIdxGq =
prevIdxHq and q ∈ QProcs(M
G) ⇔ q ∈ QProcs(MH) hold. Furthermore, p 6∈
QProcs(M H¯) by the hypothesis of Case C. Thus, Lemma 7.2 (2) for H follows
directly from part (2) of the IH for G.
Finally, consider part (3). Note that for every q ∈ P \ {p}, the following all
hold: myIdxGq = myIdx
H
q , prevIdx
G
q = prevIdx
H
q and q ∈ QProcs(M
G) ⇔ q ∈
QProcs(MH). Furthermore, p 6∈ QProcs(M H¯) by the hypothesis of Case C. Thus,
by part (3) of the IH for G, it suffices to show that there is no z ∈ QProcs(M H¯)
such that prevIdxHz = myIdx
H
p . Suppose for contradiction that prevIdx
H
z = myIdx
H
p
for some z ∈ QProcs(M H¯). Note that z 6= p since p 6∈ QProcs(M H¯), and so by
the hypothesis of Case C we further have prevIdxGz = prevIdx
H
z (hence prevIdx
G
z =
myIdxHp ) and z ∈ QProcs(M
G¯). At the same time, by line 10, Observation 7.1 (a)
and the hypothesis of Case C, prevIdxGp = myIdx
H
p and p ∈ QProcs(M
G¯) both hold.
Thus, we have shown that the following all hold: prevIdxGz = myIdx
H
p , prevIdx
G
p =
myIdxHp , z, p ∈ QProcs(M
G¯) and z 6= p. But this contradicts part (2) of the IH for
G.
The following theorem establishes the correctness of Implementation MQFS.
37
Theorem 7.3. For any H ∈ H, H|M is linearizable with respect to type Mu-
texQueue.
Proof. We will prove by induction on |H| the following claim:
IfH does not contain any bad operation executions then H¯ ∈ Lin(H|M),
M H¯ 6= ⊥, and the value of Queue at the end of H is as follows:
For any i ∈ [0..N ], if ∃p ∈ P such that at the end of H p owns index i,
and has applied Queue[myIdx ].write at line 1 of enqueue(), but since
last doing so p has not applied Queue[myIdx ].F&S at line 6 of dequeue(),
then (letting s denote succ(M H¯ , p))
Queue[i]H =
{
(i, p) if s 6∈ VisProcs(M H¯)
(myIdxHs , s) otherwise
else Queue[i]H 6= (i,−).
In the remainder of the proof we denote by β(H, i) the predicate that at the end
of execution history H, Queue[i]H has the value specified above.
Basis: f(H) = 0. It follows that H = H¯ = 〈〉, so certainly H¯ ∈ Lin(H|M). It
remains to show β(H, i) for i ∈ [0..N ], which in this case asserts that Queue[i]H 6=
(i,−). But this follows from the initialization of Implementation MQFS.
Induction Hypothesis: For any l > 0, assume that Theorem 7.3 holds for every
H such that f(H) < l.
Induction Step: We must prove Theorem 7.3 for every H such that |H| = l. Let
G be a prefix of H of length l − 1. We proceed by cases on the last step σ in H.
Cases A–E are when H ends with an base object step, and Case F is when H ends
with a non-atomic step on the target object M . In all these cases we assume that H
does not contain a bad MutexQueue operation execution. Finally, Case G is when
H does contain a bad MutexQueue operation execution.
Case A: σ is a Queue[myIdxGp ].write((myIdx
G
p , p)) (see line 1 of enqueue()). In
this case, H¯ = G¯; thus H¯ ∈ Lin(H|M) (since, by the IH, G¯ = H¯ is a linearization
of G|M = H|M), and M H¯ 6= ⊥ (since M G¯ 6= ⊥ by the IH).
To prove the theorem for H it suffices to verify that β(H,myIdxHp ) holds; all
other clauses either hold trivially (because their antecedents are false) or follow
immediately from the IH. (Note that in this case myIdxHp is the only position in
Queue changed by σ, which does not change the linearized state of M .) Since p has
just completed line 1 at the end ofH, β(H,myIdxHp ) asserts thatQueue[myIdx
H
p ]
H =
(myIdxHp , p), which indeed holds by the action of step σ.
Case B: σ is a Last .F&S (see line 2 of enqueue()) In this case,
H¯ = G¯ ◦ 〈(INV, p,M, enqueue()), (RES, p,M, OK)〉
and p 6∈ QProcs(M G¯), since H does not contain a bad operation execution. Then
certainly H¯ ∈ Lin(H|M), and M H¯ 6= ⊥.
38
In the case under consideration, all clauses of Theorem 7.3 for H either hold
trivially (because their antecedents are false) or follow immediately from the IH.
Case C: σ is a Queue[prevIdxGp ].F&S with response r for some r (see line 4 of
isHead()). In this case,
H¯ = G¯ ◦ 〈(INV, p,M, isHead()), (RES, p,M, ret)〉
where ret = true if r 6= (prevIdxHp ,−) and ret = false otherwise. Furthermore,
p ∈ QProcs(M G¯) and p 6∈ VisProcs(M G¯) since H does not contain a bad operation
execution. To show that H¯ ∈ Lin(H|M) we must show that ret = true iff p =
head(M H¯ ). Let j = prevIdxGp and consider the following subcases.
Subcase C1: Some q ∈ P owns j at the end of G. Then p = succ(M G¯, q) by
Lemma 7.2 (3), and in particular q ∈ QProcs(M G¯) by definition of succ. Further-
more Queue[j]G = (j, q) by the IH for G since p 6∈ VisProcs(M G¯). Thus, r = (j, q)
and so ret = false, while p 6= head(M G¯), hence p 6= head(M H¯ ), as wanted.
To prove the theorem for H in the case under consideration, it suffices to verify
that β(H, j) and β(H,myIndexHq ) hold; all other clauses of Theorem 7.3 forH either
hold trivially or follow immediately from the IH. (Note that in this case j is the only
position of Queue that is changed by σ, and q is the only process whose successor,
namely p, becomes visible as a result of step σ; the linearized state ofM is otherwise
unchanged.) It follows by line 2 of the algorithm that myIdxHq = prevIdx
H
p , and since
prevIdxHp = prevIdx
G
p that prevIdx
H
p = j. Thus, the conditions β(H,myIdx
H
q ) and
β(H, j) are equivalent. Furthermore, since p ∈ VisProcs(M H¯) by the action of step
σ, β(H,myIdxHq ) asserts that Queue[myIdx
H
q ]
H = (myIdxHp , p). Indeed we have
Queue[myIdxHq ]
H = Queue[prevIdxHp ]
H because myIndexHq = prevIndex
H
p
= Queue[prevIdxGp ]
H because prevIdxHp = prevIdx
G
p
= (myIdxGp , p) by the action of the last operation
execution in H
= (myIdxHp , p) because myIdx
H
p = myIdx
G
p
Subcase C2: No process owns j at the end of G. It follows that p = head(M G¯),
otherwise by line 2 and the fact that H contains no bad operation executions it
would be the case that myIdxG
pred(M G¯,p)
= j. Furthermore Queue[j]G 6= (j,−) by
the IH for G. Thus, r 6= (j,−) and so ret = true, while p = head(M G¯), as wanted.
To prove the theorem for H in the case under consideration it suffices to verify
that β(H, j) holds; all other clauses of Theorem 7.3 for H either hold trivially or
follow immediately from the IH. (Note that in this subcase the fact that p becomes
visible as a result of step σ does not affect the value of Queue at the position owned
by p’s predecessor, since p = head(M G¯) and so p has no predecessor.) By the
hypothesis of subcase C2, no process owns j at the end of G, hence no process owns
j at the end of H. Thus, β(H, j) asserts that Queue[j]H 6= (j,−). Indeed we have
Queue[j]H = (myIdxGp , p) by the action of σ
6= (prevIdxGp ,−) by Lemma 7.2 (4) for G
= (j,−)
39
Case D: σ is aQueue[myIdxGp ].F&S with response (myIdx
G
p ,−) (see line 6 of dequeue()).
In this case,
H¯ = G¯ ◦ 〈(INV, p,M, dequeue()), (RES, p,M,−1)〉 .
Since, by assumption, H contains no bad operation executions, p ∈ QProcs(M G¯),
p ∈ VisProcs(M G¯) and p = head(M G¯). By the IH, M G¯ 6= ⊥ and so M H¯ 6= ⊥.
By the case under consideration, Queue[myIdxGp ]
G = (myIdxGp ,−). By the IH,
β(G,myIdxGp ) holds, which implies along with Lemma 7.2 (1) that succ(M
G¯, p) 6∈
VisProcs(M G¯). By the specification of MutexQueue the response of a dequeue()
operation by p applied to M G¯ is −1. Thus, H¯ ∈ Lin(H|M).
To prove the theorem for H in the case under consideration, it suffices to verify
that β(H,myIdxGp ) and β(H,myIdx
H
p ) both hold; all other clauses for H either hold
trivially or follow immediately from the IH. (Recall our convention regarding when
assignments to private variables take effect; in particular, in this case, the assignment
to myIdx p at line 10 takes effect atomically with step σ. Thus, step σ causes p to
relinquish ownership of myIdxGp and acquire ownership of myIdx
H
p . Also, recall that
p has no predecessor in M G¯, and so p no longer being visible has no impact on the
meaning of β(H, i) for i 6= myIdxGp ,myIdx
H
p .)
First consider β(H,myIdxGp ). Notice that no process owns myIdx
G
p at the end
of H. This is because only p owns myIdxGp at the end of G (by Lemma 7.2 (1)),
and at the end of H p owns a different index, namely myIdxHp (note that myIdx
H
p =
prevIdxGp by line 10, and prevIdx
G
p 6= myIdx
G
p by Lemma 7.2 (4)). Thus, β(H,myIdx
G
p )
asserts that Queue[myIdxGp ]
H 6= (myIdxGp ,−). This is indeed true since
Queue[myIdxGp ]
H = (prevIdxGp ,−) by the action of step σ
6= (myIdxGp ,−) by Lemma 7.2 (4) for G
Next, consider β(H,myIdxHp ). By the hypothesis of Case D, in H p has applied
Queue[myIdx p].F&S at line 6 since it last applied Queue[myIdx p].write at line 1.
Thus, β(H,myIdxHp ) asserts that Queue[myIdx
H
p ]
H 6= (myIdxHp ,−). We now prove
that this is the case. We have that myIdxHp = prevIdx
G
p (by line 10). Also, no
process owns prevIdxGp at the end of G: p does not own it because it owns myIdx
G
p
and myIdxGp 6= prevIdx
G
p (by Lemma 7.2 (4)); and no other process owns it because
if one, say z, did then p and z would both own it at the end of H, contradicting
Lemma 7.2 (1). Since no process owns prevIdxGp at the end of G, by β(G, prevIdx
G
p ),
which holds by the IH, we have Queue[prevIdxGp ]
G 6= (prevIdxGp ,−). But then, since
prevIdxGp = myIdx
H
p , and since step σ does not change position prevIdx
G
p of Queue
(it changes position myIdxGp , which is different from prevIdx
G
p by Lemma 7.2 (4)),
we have Queue[myIdxHp ]
H 6= myIdxHp , as wanted.
Case E: σ is a Queue[myIdxGp ].F&S with response different from (myIdx
G
p ,−) (see
line 6 of dequeue()). In this case,
H¯ = G¯ ◦ 〈(INV, p,M, dequeue()), (RES, p,M, r)〉
40
where r = tempIdHp Since, by assumption, H contains no bad operation executions,
p ∈ QProcs(M G¯), p ∈ VisProcs(M G¯) and p = head(M G¯). By the IH,M G¯ 6= ⊥ and
so M H¯ 6= ⊥. By the case under consideration, Queue[myIdxGp ]
G 6= (myIdxGp ,−). By
the IH, β(G,myIdxGp ) holds, which implies that succ(M
G¯, p) ∈ VisProcs(M G¯) and
moreover that r = tempIdHp = succ(M
G¯, p). By the specification of MutexQueue
the response of a dequeue() operation execution by p applied toM G¯ is succ(M G¯, p).
Since r = succ(M G¯, p), H¯ ∈ Lin(H|M).
To prove the theorem for H in the case under consideration, we proceed as in
Case D.
Case F: H ends with an event on the target object M by process p. The proof is
analogous to the one given in Case H in the proof of Theorem 6.3 for Implementa-
tion MQFI.
Case G: H contains a bad MutexQueue operation. The proof is analogous to the
one given in Case I in the proof of Theorem 6.3 for Implementation MQFI.
7.1.1 RMR Complexity
Each access procedure of Implementation MQFS incurs O(1) steps since there are
no loops. In particular, the RMR complexity of each access procedure is O(1).
8 Conclusion
In this paper we have shown how to solve mutual exclusion for N processes us-
ing a linearizable implementation of an N -process MutexQueue object and atomic
read/write registers. In doing so we have re-cast the problem of implementing and
proving correct anO(1)-RMR (per-passage) queue-based mutual exclusion algorithm
into the intuitively more fundamental problem of implementing the underlying queue
using O(1) RMRs per operation. We have presented and proved correct two such
implementations of MutexQueue, based on the mutual exclusion algorithms of T.
Anderson and Craig [4, 8]. We believe that a MutexQueue implementation can also
be extracted from Rhee’s algorithm [24], from the two algorithms of Lee [19], as well
as from the algorithm of Mellor-Crummey and Scott [23].7
It is interesting to note that the above algorithms are precisely those that achieve
O(1) RMR complexity in both the CC and DSMmodels. Algorithms that are limited
to the CC model [12, 22] tend to have a simpler structure, intuitively by taking
advantage of the fact that any process can locally spin on any variable. This makes
it possible for processes, in particular predecessor-successor pairs, to communicate
without knowing each other’s names. In particular, in the entry protocol a process
can enter the queue and signal its predecessor by applying a single atomic operation;
7The MCS algorithm lacks the bounded exit property, and so the corresponding implementation
of MutexQueue is not wait-free due to the presence of a busy-wait loop in the access procedure
for the dequeue() operation. In particular, termination of dequeue() is only guaranteed if every
execution of enqueue() is eventually followed by an execution of isHead() by the same process.
This condition is certainly satisfied by Algorithm GQME from Section 5.
41
in contrast, MutexQueue contains distinct operations corresponding to these two
tasks. Also, in the CC model a process in the exit protocol can wake up its successor
without knowing the successor’s identity. Thus, queue-based local-spin algorithms
specific to the CC model operate in a mode significantly different from the one
captured by the MutexQueue data type.
Acknowledgments
I am deeply indebted to Vassos Hadzilacos for his thorough readings of and construc-
tive feedback on multiple earlier drafts of this paper, in particular his contributions
of the informal description of Craig’s algorithm at the beginning of Section 7, and
extensive help in wording the proofs of correctness in Sections 5, 6 and 7. I would
also like to thank Prasad Jayanti for enlightening discussions about the mutual ex-
clusion problem, and in particular for sharing the version of Craig’s mutual exclusion
algorithm on which Implementation MQFS of Section 7 is based.
42
References
[1] R. Alur and G. Taubenfeld. Results about fast mutual exclusion. In Proc. of the 13th
IEEE Real-Time Systems Symposium, pages 12–21, December 1992.
[2] J. Anderson and Y.-J. Kim. A generic local-spin fetch-and-phi-based mutual exclusion
algorithm. Journal of Parallel and Computing, 67(5):551–580, May 2007.
[3] J. Anderson, Y.-J. Kim, and T. Herman. Shared-memory mutual exclusion: Major
research trends since 1986. Distributed Computing, 16(2-3):75–110, 2003.
[4] T. Anderson. The performance of spin lock alternatives for shared-memory multipro-
cessors. IEEE Transactions on Parallel and Distributed Systems, 1(1):6–16, Jan. 1990.
[5] H. Attiya, D. Hendler, and P. Woelfel. Tight RMR lower bounds for mutual exclu-
sion and other problems. In Proc. of the 40th Annual ACM Symposium on Theory of
Computing, pages 217–226, 2008.
[6] M. Choy and A. Singh. Adaptive solutions to the mutual exclusion problem. Distributed
Computing, 8(1):1–17, 1994.
[7] T. Craig. Building FIFO and priority-queuing spin locks from atomic swap. Technical
Report TR-93-02-02, University of Washington, Seattle, WA, 1993.
[8] T. Craig. Queuing spin lock algorithms to support timing predictability. In Proc. 14th
IEEE Real-time Systems Symposium, pages 148–156, Dec. 1993.
[9] E. W. Dijkstra. Solution of a problem in concurrent programming control. Communi-
cations of the ACM, 8(9):569, Aug. 1965.
[10] C. Dwork, M. Herlihy, and O. Waarts. Contention in shared memory algorithms. J.
ACM, 44(6):779–805, 1997.
[11] W. Golab, V. Hadzilacos, D. Hendler, and P. Woelfel. RMR-efficient implementations
of comparison primitives using read and write operations. Distributed Computing, pages
1–54, 2011.
[12] G. Graunke and S. Thakkar. Synchronization algorithms for shared-memory multipro-
cessors. IEEE Computer, 23:60–69, June 1990.
[13] M. Herlihy. Wait-free synchronization. ACM TOPLAS, 13(1):124–149, Jan. 1991.
[14] M. Herlihy and J. M. Wing. Linearizability: A correctness condition for concurrent
objects. ACM TOPLAS, 12(3):463–492, July 1990.
[15] P. Jayanti. A time complexity lower bound for randomized implementations of some
shared objects. In Proc. of the 17th annual ACM symposium on Principles of distributed
computing, pages 201–210, 1998.
[16] Y.-J. Kim and J. Anderson. A space- and time-efficient local-spin spin lock. Information
Processing Letters, 84(1):47–55, Sept. 2002.
[17] L. Lamport. The mutual exclusion problem: part I – a theory of interprocess commu-
nication. J. ACM, 33(2):313–326, 1986.
[18] L. Lamport. The mutual exclusion problem: part II – statement and solutions. J.
ACM, 33(2):327–348, 1986.
[19] H. Lee. Local-spin mutual exclusion algorithms on the DSM model using fetch&store
objects. Master’s thesis, University of Toronto, 2003.
43
[20] H. Lee. Transformations of mutual exclusion algorithms from the cache-coherent model
to the distributed shared memory model. In Proc. of the 25th IEEE International
Conference on Distributed Computing Systems (ICDCS’05), pages 261–270, 2005.
[21] N. Lynch and M. Tuttle. An introduction to input/output automata. CWI-Quarterly,
2(3):219–246, 1989.
[22] P. Magnusson, A. Landin, and E. Hagersten. Queue locks on cache coherent multipro-
cessors. In Proc. 8th International Symposium on Parallel Processing, pages 165–171,
Apr. 1994.
[23] J. Mellor-Crummey and M. Scott. Algorithms for scalable synchronization on shared-
memory multiprocessors. ACM Trans. Comput. Syst., 9(1):21–65, 1991.
[24] I. Rhee. Optimizing a FIFO, scalable spin lock using consistent memories. In Proc. of
the 17th IEEE Real-Time Systems Symposium, pages 106–114, December 1996.
[25] J.-H. Yang and J. Anderson. A fast, scalable mutual exclusion algorithm. Distributed
Computing, 9(1):51–60, Aug. 1995.
44
