We formulate a modular approach to the design and analysis of a particular class of mutual exclusion algorithms for shared memory multiprocessor systems. Specifically, we consider algorithms that organize waiting processes into a queue. Such algorithms can achieve O(1) remote memory reference (RMR) complexity, which minimizes (asymptotically) the amount of traffic through the processor-memory interconnect. We first describe a generic mutual exclusion algorithm that relies on a linearizable implementation of a particular queue-like data structure that we call MutexQueue. Next, we show two implementations of MutexQueue using O(1) RMRs per operation based on synchronization primitives commonly available in multiprocessors. These implementations follow closely the queuing code embedded in previously published mutual exclusion algorithms. We provide rigorous correctness proofs and RMR complexity analyses of the algorithms we present.
Introduction
Synchronization is a fundamental challenge in asynchronous shared memory multiprocessor systems, where processes executing in parallel must exercise caution while accessing shared data structures. Unless concurrent access to such data structures is directly supported in hardware, careful coordination is necessary at the software level to prevent corruption of the data structure and ensure that processes executing on different processors reach consistent views of the data. The dominant approaches to such coordination are mutual exclusion, and non-blocking synchronization.
Mutual exclusion (ME) was formulated by Dijkstra [9] , and later formalized by Lamport [17, 18] . In this approach, processes take turns accessing the shared data structure. The execution path of each process is modelled as a repeating sequence of four sections, illustrated below in Figure 1 . Access to the shared data structure is confined to a special critical section (CS), to which the process must gain exclusive access by executing the entry protocol. Similarly, an exit protocol is executed upon completing the CS to signal that the CS can now be entered by another process. Between executions of the CS, a process lives in the non-critical section (NCS). The set of variables accessed by a process while in the CS or the NCS is disjoint from the set of variables accessed while in the entry or exit protocol.
An execution by a process of the entry protocol, CS and exit protocol is referred to as a passage (through the ME algorithm). The entry section is sometimes divided into a doorway, where a process enters a queue by executing a bounded number of steps, and a waiting room, where it waits for its predecessor in the queue to exit the CS. This type of algorithm is referred to as first-come first-served (FCFS). Lockout freedom (also referred to as starvation freedom) is a progress whereby every process that begins the entry protocol eventually enters the CS, provided that no process halts outside the NCS. Finally, deadlock freedom is a weaker property that guarantees the same but only provided that no process passes through the CS infinitely often, and is considered the weakest progress property required for correctness of a mutual exclusion algorithm.
In contrast to mutual exclusion, non-blocking synchronization requires some measure of progress regardless of the rates at which other processes are executing. For example in wait-free synchronization [13] , a process must complete each access to the shared data structure in a finite number of their own steps. The key idea behind universal constructions of wait-free data structures is that faster processes assist slower ones in performing updates. To that end, processes exploit hardware synchronization primitives to agree on the order in which updates are applied, and hence on the state of the shared data.
Mutual exclusion and wait-freedom have complementary characteristics. ME is a blocking approach since a fast process can spend an unbounded amount of time in a busy-wait loop, which typically involves repeatedly testing or spinning on one variable, while waiting for another process to complete the CS and then write that variable. In contrast, in a wait-free algorithm a process ensures its own progress even if all others halt at an arbitrary point in their execution. In that sense, wait-freedom is a stronger progress property than the one underlying mutual exclusion. Not sur- prisingly, there exist shared object types for which wait-free implementations are provably more costly than blocking ones. For example, any wait-free N -process implementation of fetch-and-increment using atomic read/write registers (subsequently referred to simply as registers) and fetch-and-store requires Ω(log N ) remote memory references (RMR, discussed below) [15] . In contrast, there is a blocking implementation of fetch-and-increment from registers and fetch-and-store using only O(1) RMRs. In light of its lower cost, mutual exclusion remains the dominant approach to synchronization in practice.
Time Complexity of Mutual Exclusion Algorithms
Analyzing the time complexity of ME algorithms requires some awareness of the shared memory hardware architecture as different memory operations may incur significantly different latencies. This is due to the growing disparity between processor speed and memory access speed, which motivates multiprocessor designs based on the paradigms of non-uniform memory access (NUMA) and/or caching. Two important classes of such architectures are illustrated in Figure 2 [23, 3] . In a Distributed Shared Memory (DSM) machine, each memory module can be accessed locally by some processor without involving the processor-to-memory interconnect, thus reducing much of the latency. Processors in cache-coherent (CC) machines, on the other hand, maintain local copies of data inside caches, which are synchronized by a coherence protocol. Thus, any shared memory location can become local at runtime to any processor in the CC model.
In both the DSM and CC models, memory operations are classified as either remote or local. This classification is straightforward in the DSM model as locality is determined through static allocation of a variable in particular a memory module. In contrast, in the CC model locality of a memory operation is determined by the state of the processor's cache, which depends on prior steps of the same processor and possibly others, as well as by the type of memory access (e.g., read versus write), which determines the behaviour of the coherence protocol. For our purposes, we consider the following ideal behaviour in the CC model: after a processor reads a variable, this variable is held in that processor's cache and can be read locally (i.e., without incurring an RMR) until another processor writes the same variable.
In both the DSM and CC models, we will assume for worst-case analysis that each process runs on a distinct processor. Remote operations, referred to as remote memory references or RMRs, can be orders of magnitude more costly than local ones. Consequently, RMR complexity quantifies not only the overhead of accessing the processor-to-memory interconnect, but also the main source of latency incurred while executing a mutual exclusion algorithm. Mutual exclusion algorithms with bounded RMR complexity are referred to as local-spin, and have been the focus of recent research [3] on shared memory multiprocessors. In such an algorithm, all busy-waiting must be done by spinning on locally accessible variables.
To obtain a more direct measure of time complexity, one can consider the overhead of contention (for the processor-to-memory interconnect and shared memory modules) in addition to RMRs. This overhead can be quantified by counting memory stalls under the assumption that concurrent accesses to a common shared variable are serialized [10] . In that case, the i'th process in the serialization order incurs i − 1 memory stalls as it waits for its predecessors to complete their operations. The exact overhead of contention depends on the shared memory architecture. Most notably, in a bus-based system a snooping protocol makes it possible for multiple processes to read a common shared variable simultaneously. In that case one counts memory stalls for concurrent writes but not for concurrent reads.
Time complexity measures for mutual exclusion algorithms typically omit local memory operations. Although local operations do have an impact on the overall latency, such a complexity measure is generally unbounded. Even if the CS is empty, due to the assumption of asynchrony there is no bound on the time that a process leaving the CS takes to execute the exit protocol and allow the next process to proceed into the CS. Furthermore, in an algorithm using only registers, even the first process to enter the CS may perform an unbounded number of steps unless it is executing solo [1] . It is possible to circumvent this problem by defining time complexity in terms of a virtual clock that ticks once for every interval of time in which every process has been given sufficient time to perform one operation on a shared memory object. The response time of a mutual exclusion algorithm is the number of such clocks ticks from the time a process leaves the NCS to the time it enters the CS [6].
Contributions of This Paper
Consider the following simple and intuitively appealing idea for an FCFS mutual exclusion algorithm: Processes wait in a queue to enter the CS. Only the head of the queue may enter the CS. A process leaving the NCS adds itself to the end of the queue and, if it is not the head of the queue, it waits by repeatedly reading a local spin variable. A process leaving the CS removes itself from the (head of the) queue. It then writes a shared variable to signal its successor (now the new head of the queue), perhaps after checking if such a process exists, to stop waiting and proceed into the CS.
Clearly, race conditions can arise when a process contending for entry to the CS checks whether it is the head of the queue (perhaps as it does so another process is about to enter the queue), and when a process leaving the CS checks whether there is a successor in the queue (perhaps as it does so another process is about to become its successor). Handling these race conditions properly, while relying on standard synchronization primitives and using as few RMRs as possible, is a delicate task.
Several algorithms based on the above idea have appeared in the literature [4, 12, 23, 8, 22, 24, 19] . The common simple structure underlying all these algorithms, however, is obscured by the intricate details of handling the race conditions described above. Furthermore, to our knowledge, some of these algorithms have not been proved correct.
In this paper we propose a modular approach to the design and analysis of such algorithms. We first define a queue-like shared data structure, called MutexQueue. This data structure allows a process to add itself to the end of the queue, query whether it is the head of the queue, and remove itself from the head of the queue (simultaneously determining the identity of its successor in the queue, if one exists). We then present a very simple generic queue-based mutual exclusion algorithm along the lines described above, that uses this data structure as a "black box". We prove the correctness of this algorithm based on the abstract properties of MutexQueue. This algorithm uses only a constant number of RMRs, beyond what are needed to implement the "black box" MutexQueue, and applies only a constant number of operations on MutexQueue, per passage.
We then present two implementations of MutexQueue, both using only a constant number of RMRs for each operation in the DSM and CC models. The first uses registers and the fetch-and-increment primitive (which atomically increments a shared memory word and returns its previous value) while the second uses registers and the fetch-and-store primitive (which atomically assigns a new value to a shared memory word and returns its previous value).
The two implementations of MutexQueue are not novel: they are embedded in previously published mutual exclusion algorithms; here, we have simply recast them as implementations of the MutexQueue data structure. Specifically, the first implementation of MutexQueue is based on a mutual exclusion algorithm due to Tom Anderson [4] , as subsequently modified by James Anderson and Yong-Jik Kim. The second implementation of MutexQueue is based on a mutual exclusion algorithm due to Craig [8] . To our knowledge, however, these mutual exclusion algorithms have not been proved correct. 1 In this paper we give rigorous correctness proofs of these algorithms (as implementations of MutexQueue).
The advantage of our modular approach is that it "factorizes" the common structure of some queue-based algorithms, in the form of the generic mutual exclusion algorithm. The correctness of this common part need only be proved once. What is left in each of these algorithms, can be viewed as an implementation of the MutexQueue data structure.
Our definition of the MutexQueue also sheds light on how exactly processes coordinate access to the critical section in queue-based mutual exclusion algorithms. For example, in such algorithms a process does not enter the queue and also discover whether it became the head element in one atomic step. Rather, two atomic steps are required, and are therefore represented by distinct MutexQueue operations. In contrast, a process can exit the queue and discover its successor in one atomic step. Surprisingly, sometimes a process can also exit the queue and discover no other process, even though a successor does exist! In particular, this occurs if the successor has entered the queue but has not yet queried the head element. Thus, it is the latter step (i.e., querying the head) that makes a process "visible" to its predecessor, and not the mere act of entering the queue.
Related Work
The RMR complexity of mutual exclusion algorithms is a function of the number of processes, N . The best known upper bound on the worst-case RMR complexity per passage of algorithms based on (atomic) read/write registers only is O(log N ) [25, 16] . This bound is tight [5] . The same tight bound holds for the class of mutual exclusion algorithms that in addition to registers use compare-and-swap (CAS) or load-linked/store-conditional (LL/SC) -primitives that conditionally change the value of a shared memory location [11] .
Using synchronization primitives such as fetch-and-store (i.e., swap between shared memory and a private register) and fetch-and-increment, it is possible to devise mutual exclusion algorithms with worst-case RMR complexity of only O(1) [4, 12, 23, 8, 22, 24, 19] . 2 The properties of these algorithms are summarized in Table 1 . All of these algorithms are based on the concept of a process queue, which determines the order in which processes enter the CS and enables efficient signaling between processes that enter the CS consecutively. Thus, in addition to mutual exclusion and lockout freedom, these algorithms also satisfy FCFS. 1 A variant of Craig's algorithm [8] is proved correct in [19] . This variant is intuitively simpler, but uses an array of length 2N instead of N + 1 to encode the queue of processes waiting to enter the critical section.
2 The original algorithm of T. Anderson [4] uses a constant number of RMRs in the CC model but is not local-spin in the DSM model. A constant-RMR DSM variant using the same synchronization primitives can be obtained by applying the transformation described in [20] or in footnote 7 of [2] . Rhee's algorithm [24] is targeted at a variant of the DSM model with weaker memory consistency, where read/write operations executed by one processor may appear to take effect in a different order to another processor due to buffering of writes. In this model, a special fence operation is used to force previously buffered writes to take effect globally. var := φ(old, input) 3.
Publication
return old Many of the queue-based constant-RMR mutual exclusion algorithms cited above were presented in the context of performance studies, and lack rigorous proofs of correctness. Moreover, to our knowledge the only attempt to generalize or unify these algorithms, all of which are based on the process queue concept, is the generic algorithm of Anderson and Kim [2] . This algorithm solves mutual exclusion using O(1) RMRs per passage given a suitable shared-memory primitive fetch-and-φ, which corresponds to the (atomic) execution of the pseudocode shown in Figure 3 .
The fetch-and-φ primitive can be instantiated to a variety of shared-memory primitives by choosing a suitable function φ. For example, a fetch-and-store corresponds to φ(old, input) ≡ input Similarly, if we use input to encode a pair of values (a, b), a compare-and-swap corresponds to
where a and b are the expected and target value of compare-and-swap. Thus, fetchand-φ generalizes various types of read-modify-write primitives, including conditionals.
Unlike its predecessors, the generic fetch-and-φ algorithm of [2] uses two process queues instead of one, in order to cope with the generic and limited assumptions on the behaviour of the fetch-and-φ primitive. Consequently, an additional mechanism is needed to control access to the critical section, and the algorithm loses the (FCFS) property inherent in earlier single-queue solutions.
Correctness of the generic algorithm depends on a condition on the fetch-and-φ primitive related to its ability to return distinct values over repeated invocations. This condition is formalized in terms of a property of a primitive called rank. Intuitively, the higher the rank, the better the primitive at solving mutual exclusion efficiently with respect to RMR complexity. A rank of 2N or greater is sufficient for the generic algorithm, but it is not known whether rank Ω(N ) is necessary for solving mutual exclusion with O(1) RMRs per passage. Examples of primitives that have rank 2N or more include an r-bounded fetch-and-increment (i.e., φ(old, input) = min(r − 1, old + 1)) for r ≥ 2N , which has rank r, and fetch-andstore, which has infinite rank. Compare-and-swap as well as test-and-set can also be modeled as fetch-and-φ primitives, but both have rank only two.
Any mutual exclusion algorithm that uses only compare-and-swap and registers requires Ω(log N ) RMRs [5, 11] . In contrast, there are mutual exclusion algorithms that use only fetch-and-store and registers that require only O(1) RMRs (e.g., [8] ). So, from the point of view of supporting RMR-efficient implementations of mutual exclusion, fetch-and-store is more powerful than compare-and-swap. It is interesting that the opposite is the case from the point of view of supporting wait-free implementations of objects. It is well-known from Herlihy's work that compare-and-swap and registers support wait-free implementation of any object shared by any number of processes, while there are objects shared by only three processes that cannot be implemented wait-free using only fetch-and-store and registers [13] .
Road Map
First, we present the model of computation in Section 4. In Section 5 we present our generic formulation of the queue-based mutual exclusion algorithm, and prove its correctness properties, assuming a suitable implementation of MutexQueue, a novel queue-like data structure. Then, in Sections 6 and 7, we discuss two implementations of MutexQueue, based on the fetch-and-increment and fetch-and-store primitives, respectively. Our implementations closely follow the queuing code embedded in existing queue lock algorithms [4, 8] . We conclude in Section 8 with a discussion of the applicability of our analysis technique.
Model of Computation and Definitions
Our model of computation is based on [14] . A concurrent system models an asynchronous shared memory system where N processes communicate by executing operations on shared objects. Formally, a concurrent system is represented as a triple S = (P, V, H), where P = {0, 1, . . . , N − 1} is a set of process identifiers, V is a set of shared objects, also referred to as variables, and H is a set of execution histories. Each process identifier corresponds to a process, which is a sequential thread of control that invokes operations on objects, one at a time, and receives corresponding responses. An object represents a data structure with a well-defined set of states and set of operations that modify the state and return responses to processes. Processes and objects can be formally modelled as input/output automata [21] , but here we adopt a more informal approach.
Steps Informally, we think of the behaviour of processes in a concurrent system S = (P, V, H) as a collection of steps. There are two categories of steps -atomic and non-atomic. In an atomic step, a process p ∈ P applies operation op on some object v ∈ V and receives the response ret of this operation. This is denoted by a tuple (ATOM, p, v, op, ret). We use atomic steps to denote operations on atomic objects, such as those provided in hardware. In a non-atomic step, a process p either invokes an operation op on some object v ∈ V, or it receives the response ret of the last operation p invoked on v. The former is called an invocation step, and is represented by a tuple (INV, p, v, op). The latter is called a response step, and is represented by a tuple (RES, p, v, ret). We use non-atomic steps (along with atomic steps) to denote operations on objects that are simulated in software from atomic objects, as explained later.
Execution Histories
An execution history, or history for short, is a sequence of steps. An execution history is generated as processes accesses objects according to the transition functions of the corresponding automata, which we will describe using pseudocode. The histories we will consider contain either only atomic steps, or a combination of atomic and non-atomic steps where each object is accessed by steps of exactly one category.
We say that H is a history of (or over ) object v if every step in H accesses v. A response step e R = (RES, p, v, −) in H matches the last preceding invocation step e I = (INV, p, v, −) in H (if one exists). 3 An invocation step is pending in H if it is not followed by a matching response step.
An operation execution in a history H ∈ H is either a pair of matching invocation/response steps, or a pending invocation step. We call an operation execution complete in the former case, and pending in the latter. Two operation executions are concurrent in H unless the response of one precedes the invocation of the other in H. We say that H is sequential if it contains no concurrent operation executions, and complete if it contains no pending invocations. The set H is prefix-closed, meaning that if H ∈ H and G is a prefix of H then G ∈ H.
For every history H and set P of process IDs, we denote by H|P the maximal subsequence of H consisting only of steps by processes in P . Similarly, for every history H and set V of objects, we denote by H|V the maximal subsequence of H consisting only of steps on objects in V . For a single process ID p or object v, we use H|p and H|v as shorthands for H| {p} and H| {v}, respectively. A process p is active in a history H if H|p is not empty.
Object Types and Conformity to a Type
Every object has a type τ = (P, S, s init , O, R, δ) where P is a set of process IDs (defined as for concurrent systems), S is a set of states, s init ∈ S is the initial state, O is a set of operations, R is the set of operation responses, and δ : P ×S×O → S×R is a (one-to-many) state transition mapping. The transition mapping δ is intended to capture the behaviour of objects of type τ , in the absence of concurrency, as follows: if a process p applies operation op to an object of type T that is in state s, then the object may return to p the response r and change its state to s ′ if and only if (s ′ , r) ∈ δ(p, s, op). A complete, sequential execution history H of object v of type τ induces a sequence of tuples (p i , op i , r i ) such that in the i'th atomic step or operation execution in H (depending on the structure of H), process p i applies operation op i and receives response r i . We say that v conforms to τ in H if there exists a sequence s 0 , s 1 , s 2 , . . . of states of τ such that s 0 = s init and for each i ≥ 1,
Algorithms
An algorithm is a concurrent system S = (P, V, H) where every history H ∈ H contains only atomic steps over V. We call such a history a one-level history, to distinguish it from the more complex execution history of an implementation, defined later. The set of histories is defined informally through a pseudo-code procedure for each process as follows: For each operation that a process p applies to a shared variable, H records an atomic step that encodes the variable, the operation applied, and its response. Accesses to private variables correspond to state changes in the automaton for a process and are not explicitly recorded in the history. Steps of different processes can be interleaved in H arbitrarily. An infinite history H of an algorithm is fair if every process that is active in H takes infinitely many steps. (We do not consider terminating algorithms in this paper.) Implementations An implementation describes how to simulate a target object of a particular target type using a set of base objects of specified types. Specifically, for each operation of the target type and each process, we define an access procedure that computes the response of the operation under consideration by performing operations on the base objects. An implementation is a concurrent system denoted I = (P, V, H) where the set of shared objects V consists of a distinguished target object, denoted T , and a set of base objects. Histories in H contain a combination of atomic and non-atomic steps. Every history H ∈ H is well-formed, meaning that the following conditions hold:
• T is accessed only using non-atomic steps, and for every base object v ∈ V, v is accessed only using atomic steps.
• For every base object v ∈ V, v conforms to its type in H|v.
• If e I = (INV, p, T, −) is pending in H then e I is the last non-atomic step performed by p in H.
• If e R = (RES, p, T, −) is in H then e R matches the last invocation step of p that precedes e R in H.
• If e A = (ATOM, p, −, −, −) is in H then it occurs after some invocation step and before the matching response (if one exists).
We call an execution history of an implementation a two-level history since operation executions on the target object and on base objects are nested.
Histories in H correspond to executions of the access procedures as follows. When a process p begins executing the access procedure for operation op on T , the history records the step (INV, p, T, op). As p subsequently executes the access procedure, the history records corresponding atomic steps by p on base objects. Finally, when the access procedure returns a value ret, then the history records the response step (RES, p, T, ret). Processes may call the access procedures arbitrarily many times and in arbitrary order. An infinite history H of an implementation is fair if every process that is active in H either takes infinitely many steps, or applies a response step as its last step in H. Informally, this means that in a fair history a process may not stop executing in the middle of an access procedure.
Linearizability
Linearizability [14] is widely accepted as a correctness condition for concurrent objects. Informally, it states that operation executions in a history of an implementation must appear to take effect instantaneously at some point between the corresponding invocation and response steps. Formally, linearizability is defined as follows. Given a history H of an implementation, < H is the partial order over the set of operation executions in H defined as follows: oe 1 < H oe 2 iff the response of oe 1 occurs in H before the invocation of oe 2 . Two execution histories G and H are equivalent if every process executes the same sequence of steps in both histories. Letting T denote the target object, a completion of H|T is a well-formed history H ′ obtained from H by either completing (with a response event) or removing every pending operation execution. H|T is linearizable with respect to type τ if it has a completion equivalent to some complete sequential historyH over T such that < H ⊆<H and where T conforms to type τ inH. In this case we say thatH is a linearization of H. We denote the set of possible linearizations of H by Lin(H). We say that an implementation I = (P, V, H) is linearizable with respect to type τ if for every history H ∈ H, H|T is linearizable with respect to type τ .
Additional Notation
Let G, H be execution histories. If s is a step, we denote by s ∈ H that step s occurs in H, by proc(s) the process that executes s, and by var(s) the object on which s operates. We denote by G H that G is a prefix of H, and by G ≺ H that G is a proper prefix of H. If v is an object and H is an execution history such that H|v is complete and sequential, then we denote the state of v at the end of H|v by v H .
Given execution histories (or, more generally, sequences) H and G, let G • H denote the concatenation of G and H (i.e., elements of H appended to G). If G is finite, |G| denotes the length of G. For 0
5 Generic Queue-Based Algorithm
The MutexQueue Type
An N -process MutexQueue is a queue-like object type that stores a subset of N process IDs (subsequently also referred to as processes). The state of MutexQueue is an ordered pair (Q, V ), where Q and V are a sequence and a set, respectively, of elements from P. Informally, Q represents the sequence of processes waiting to enter the critical section, and V is a subset of these processes that are visible. Intuitively, a process becomes visible when it "makes itself known" to its predecessor in the queue. The initial state is ( , ∅). In addition, we define a special broken state ⊥, indicating that a process has violated the etiquette for accessing MutexQueue (explained below).
A MutexQueue supports three types of operations: enqueue(), isHead(), and dequeue(). Informally, enqueue() adds the executing process to the end of the queue and always returns the response OK; isHead() makes the executing process visible and returns true if and only if this process is the head of the queue; and dequeue() removes the executing process from the head of the queue, and returns the ID of the successor process in the queue, if it exists and is visible (or −1 otherwise). As mentioned earlier, processes are expected to follow a certain etiquette in accessing MutexQueue. Specifically, a process must not invoke enqueue() if it is already in the queue, isHead() if it is not in the queue or is already visible, and dequeue() if it is not the head of the queue or is not visible. Failure to comply with this etiquette causes the MutexQueue to enter the broken state ⊥, and thereafter all responses are completely arbitrary. These restrictions on accessing MutexQueue make it easier to implement this object. As we will see, they are observed by our generic algorithm that uses MutexQueue to solve mutual exclusion (see Section 5.2).
Prima facie, it would seem that we can have a simpler definition of MutexQueue, and a correspondingly simpler version of the generic mutual exclusion algorithm based on MutexQueue, by combining the enqueue() and isHead() operations into a single operation that adds the ID of the executing process to the end of the queue, and returns true if that process is the head of the queue and false otherwise. Unfortunately, the resulting operation seems too strong; we were not able to find an implementation for it that uses standard synchronization primitives and incurs only a constant number of RMRs. By splitting the functionality into two separate operations, such implementations become feasible.
Formally, an N -process MutexQueue is specified by the tuple (P, O, R, S, τ ) where 
Note that for every p ∈ P, the values pred(s, p) and succ(s, p) are uniquely defined by Observation 5.1 (b). If p, q ∈ P and s is a MutexQueue state then we use the phrases "s is empty," "p is in the queue," "p is the head of s," "p is visible in s," "p is the successor of q in s" and "p is the predecessor of q in s," to denote the conditions empty(s), p ∈ QP rocs(s), p = head(s), p ∈ VisP rocs(s), p = succ(s, q), and p = pred(s, q), respectively. 
Generic Mutual Exclusion Algorithm
In this section we analyze the mutual exclusion algorithm shown in Figure 4 . In addition to an atomic MutexQueue object, the algorithm uses an array Wait[00..N − 1] of Boolean read/write registers.
Informally, the algorithm uses the MutexQueue object M to maintain a queue of processes that are competing to enter the critical section. The enqueue() operation at line 2 constitutes the doorway, and the remaining statements leading up to the CS comprise the waiting room. In the exit protocol, spanning lines 7 to 9, a process signals its successor in M (if present and visible) to exit the waiting room and proceed to the CS.
We use syntax of the form V.op(args) in Figure 4 to indicate that process p invokes operation op(args) on the shared variable V . Operations on shared registers are denoted read and write.
Correctness Properties
Mutual Exclusion (ME): at most one process is in the CS at any time.
First-Come First-Served (FCFS): processes enter the CS in the order in which they are enqueued at line 2.
Lockout Freedom (LF): if a process leaves the NCS then it eventually enters the CS.
Bounded Exit (BE): if a process leaves the CS then it enters the NCS within a bounded number of its own steps.
Proof of Correctness
Let S = (P, V, H) be the concurrent system corresponding to Algorithm GQME where
For any history H ∈ H, any process p ∈ P, and any integer i ∈ Z + , we say that p is in the CS in passage i at the end of H if and only if p performs its last step in H during its i'th passage through Algorithm GQME, and furthermore this step is: either (ATOM, p, M, isHead(), true) (see line 3); or (ATOM, p, Wait [p], write(true), OK) (see line 5). Similarly, we say that p has completed the CS in passage i at the end of H if and only if H contains a step (ATOM, p, M, dequeue(), −) (see line 7) performed by p during passage i through Algorithm GQME.
For ease of exposition, we distinguish a number of phases in which a process may be at the end of a history H ∈ H. The phases are defined in Table 2 and the transitions between them are illustrated in Figure 5 . Each phase is bounded by steps on shared objects.
Note that the first five phases defined in Table 2 are mutually exclusive, whereas EXIT is a sub-phase of NEAR NCS, and is not necessarily traversed by a process in every passage through Algorithm GQME. We will subsequently use the name of a phase as a predicate indicating that a process is in the given phase, e.g., WAIT(p) H = true iff process p is in the WAIT phase at the end of a history H ∈ H.
Phase name

From operation
To operation Notes/conditions DOORWAY enqueue() at line 2 isHead() at To prove the correctness of the algorithm we will establish the following invariant. Proof. Let H, p and q be as in the hypothesis of the lemma and suppose for contradiction that p has not executed M.dequeue() in passage i in H. From Theorem 5.3 and Invariant 5.2-(H, q), it follows that M H = ⊥ and q = head(M H ). Since p was enqueued in passage i before q in passage j, this implies that p in passage i has been dequeued in H. Since M H = ⊥, it follows that p has executed M.dequeue() in passage i in H, which contradicts the original hypothesis.
Corollary 5.5. Algorithm GQME satisfies Mutual Exclusion.
Proof. Suppose for contradiction that there exists an execution history H ∈ H at the end of which distinct processes p and q are both in the CS, in passages i and j, respectively. LetH ∈ Lin(H). Without loss of generality, suppose that inH, p executes M.enqueue() in passage i before q executes M.enqueue() in passage j. Note that at the end ofH, p and q are both in the CS, in passages i and j, respectively, and in particular p has not executed M.dequeue() in passage i (since p has not invoked M.dequeue() in passage i in H). Thus,H, p, and q contradict Lemma 5.4.
Corollary 5.6. Algorithm GQME satisfies First-Come First-Served.
Proof. Suppose for contradiction that there exists an execution history H ∈ H in which process p completes its execution of M.enqueue() in passage i before q begins its execution of M.enqueue() in passage j, and at the end of which q is in the CS in passage j but p has not completed the CS in passage i. In particular, p has not invoked M.dequeue() in passage i in H. LetH ∈ Lin(H). Then p executes M.enqueue() in passage i before q executes M.enqueue() in passage j in H. Furthermore, at the end ofH, q is in the CS in passage j but p has not executed M.dequeue() in passage i (since p has not invoked M.dequeue() in passage i in H). Thus,H, p, and q contradict Lemma 5.4.
Theorem 5.7. Algorithm GQME satisfies Lockout Freedom.
Proof. Suppose for contradiction that there is an infinite fair history H ∈ H in which some process p begins some passage i and then takes infinitely many steps but never completes passage i. By the structure of the algorithm, p in passage i loops forever at line 4, repeatedly reading Wait[p] = true. Let E be a prefix of H up to but not including the last step (ATOM, p, M, enqueue(), OK) (see line 2). Choose p so that |E| is minimal. Let F be a prefix of H up to and including the last step (ATOM, p, M, isHead(), ret ) for some response ret. Since p loops forever at line 4 it follows that F exists, E F , and ret = false. Furthermore, M F = ⊥ by Theorem 5.3, and p = head(M F ) since ret = false, so pred(M F , p) = ⊥. Let q = pred(M F , p) and note that since |E| minimal and since H is fair, q eventually enters phase NEAR NCS in H after F . In particular, q eventually executes (ATOM, q, M, dequeue(), p), (ATOM, q, Wait[p], write(false), OK), in that order, corresponding to line 7 and line 9. Now let G be any prefix of H such that F G H, in which q has executed the above two steps. It follows that Wait[p] G = false, which contradicts p repeatedly reading Wait[p] = true at line 4 in H after the prefix F . (We do not consider the possibility of p looping forever during a dequeue() operation on M because we assume in this section that M is an atomic base object. Later on we will show for each implementation of MutexQueue that each operation on the implemented object incurs O(1) steps.)
Theorem 5.8. Algorithm GQME satisfies the bounded exit property.
Proof. The result follows directly from the structure of Algorithm GQME. (We do not consider the number of steps incurred during a dequeue() operation on M because we assume in this section that M is an atomic base object. Later on we will show for each implementation of MutexQueue that a call to dequeue() incurs O(1) steps.) 6 Wait-free Implementation of MutexQueue Using Fetchand-Increment
The implementation of an N -process MutexQueue object described in Figure 6 is based on the mutual exclusion algorithm of T. Anderson [4] , as modified by J. Anderson and Y.-J. Kim for efficient operation in the DSM model (see footnote 7 in [2] ). It relies on a shared object supporting a fetch-and-increment (F&I()) operation, which atomically increments a variable and returns its previous value. We assume that this shared object can also be reset to an initial value, e.g., via a write. Implementation MQFI (Figure 6 ) explicitly maintains a queue of processes using a pair of circular arrays. When a process enqueues itself, it obtains an index in the two arrays by atomically incrementing variable Ctr at line 1. Thus, the set of processes enqueued at a given time maps to a contiguous (modulo N ) block of array indices. The array P roc stores the IDs of enqueued processes (that are visible), and array Stat tracks the index of the head element and the visibility of each process. Roughly speaking, this is done as follows: when a process p enqueues itself after a predecessor q, it is assigned array index i, where Stat 
Proof of Correctness
We denote Implementation MQFI (shown in Figure 6 ) of type M utexQueue formally as I MQFI = (P, V, H) where P = {0..N − 1} and V consists of: the base objects {Ctr, Stat[0..N − 1], P roc[0..N − 1]}, denoted subsequently as the set B, and a target object M . Each H ∈ H is a two-level execution history where processes call the procedures enqueue(), isHead() and dequeue() as explained in Section 4. For each such procedure call, H records an invocation step on M for the corresponding operation and, if the procedure call terminates, a matching response step on M with a response equal to the value returned by the procedure call. Similarly, H contains an atomic step for each operation that a process applies to one of the base objects B.
Implementation I MQFI simulates a MutexQueue object that can be used in Algorithm GQME (Figure 4) provided that it is linearizable with respect to the MutexQueue type, and that each call to an access procedure incurs O(1) steps. The latter property follows easily from the structure of the access procedures and also implies O(1) RMR complexity. Therefore, we focus at linearizability. Specifically, we must show that for every H ∈ H, H|M is linearizable with respect to the MutexQueue type. To that end, we will explicitly construct a candidate linearization H of H|M , and prove that M conforms to the MutexQueue type. We will do this using an invariant that relates the state of the base objects to the "linearized state" of M , which is determined by our candidate linearization.
Given H ∈ H, we constructH as follows. Recall that in Figure 6 , the access procedure for each operation of type MutexQueue contains one or more accesses to base objects. For each MutexQueue operation execution in H, we define one of these base object accesses as the linearization point of that operation execution. Intuitively (and as we will prove in Theorem 6.3), the order of the linearization points determines the order in which the MutexQueue operations that contain them are linearized. Specifically, the linearization point of
• an enqueue() operation execution is the base object step Ctr.F&I() at line 1;
• an isHead() operation execution is the base object step Stat[index].F&I() at line 4;
• a dequeue() operation execution in which Stat[(index + 1) mod N ].F&I() at line 6 returns 1 is the base object step P roc[(index + 1) mod N ].read() at line 7; and
• a dequeue() operation execution in which Stat[(index + 1) mod N ].F&I() at line 6 returns a value other than 1 is that base object step itself.
Note that the response of a MutexQueue operation execution is uniquely determined if its linearization point has occurred. For enqueue(), the response is always OK. For isHead(), the response is true if and only if the linearization point's response is 1. For dequeue(), the response is the response of the linearization point, if the F&I() at line 6 returns 1; and −1, otherwise.
For any H ∈ H, letH denote the complete sequential history over M defined below, based on the linearization points present in H:
•H contains each operation execution invoked in H|M whose linearization point appears in H, with the response determined by this linearization point, and no other steps.
• Operation executions inH occur in the same order as the corresponding linearization points in H.
Note that, by definition,H is a history over the target object M , so we can use the notation succ(MH , p) and pred(MH, p) defined in Section 5.1. We also make extensive use of the following notation: index H p is the last value read from Ctr by p in H, reduced mod N , or ⊥ if H|p|Ctr = . Informally, index H p denotes the value of the private variable index p at the end of H, assuming that index is updated atomically with the response of Ctr.F&I() at line 1 of enqueue().
Informally, the following lemma says that two processes currently in the queue cannot be assigned the same array index. Lemma 6.2. For any H ∈ H and for any p, q ∈ P suppose thatH ∈ Lin(H|M ), MH = ⊥, p ∈ QP rocs(MH), q ∈ QP rocs(MH), and index
Proof. Suppose for contradiction that p = q. Without loss of generality, assume that p's last enqueue() inH precedes q's last enqueue(). Then q is the k'th process enqueued after p inH for some k = mn and some m ≥ 1. Let (Q, V ) = MH. It follows that |Q| ≥ k + 1 (i.e., Q contains at least p and a chain of k successors up to and including q). Since m ≥ 1 it follows that k + 1 > N , so by the pigeonhole principle Q contains two instances of some element, which contradicts Observation 5.1 (b).
Next, define a bad MutexQueue operation execution as one that violates the access etiquette for MutexQueue. More precisely, if H ∈ H then a MutexQueue operation execution oe by process p in H is bad if and only if there exists a prefix G of H that contains the invocation of oe but not its linearization point, such that G ∈ Lin(G|M ), MḠ = ⊥, and one of the following holds:
• oe is enqueue() and p ∈ QP rocs(MḠ)
• oe is isHead() and either p ∈ QP rocs(MḠ) or p ∈ VisP rocs(MḠ)
• oe is dequeue() and either p = head(MḠ) or p ∈ VisP rocs(MḠ)
The following theorem establishes the correctness of Implementation MQFI.
Theorem 6.3. For any H ∈ H, H|M is linearizable with respect to type MutexQueue.
Proof. We will prove by induction on |H| the following claim:
If H does not contain any bad operation executions thenH ∈ Lin(H|M ), MH = ⊥, and the values of the elements of Stat at the end of H are as follows:
between lines 6 and 7 of dequeue() in H)
Informally, the above statement means the following. When p enqueues itself, Stat[index H p ] is 1 if p is the head of M and 0 otherwise. Stat[index H p ] is subsequently incremented once when p becomes visible, and once when p becomes the head (or is about the become the head and its predecessor has partially completed dequeue()). The latter two operations may happen in either order. Finally, Stat[index H p ] returns to 0 once p is visible, is the head of M and has begun dequeuing itself (i.e., executed line 5). Furthermore, when M is empty, Stat[i] = 1 if i is the array index that will be assigned to the next process that enqueues itself, and Stat[i] = 0 otherwise.
Note that, by Lemma 6.2, for every i ∈ [0..N − 1], there is at most one p ∈ QP rocs(MH) such that i = index H p . In the remainder of the proof we denote the predicate that Stat[i] H has the value specified above by β(H, i).
Basis: |H| = 0. It follows that H =H = , so certainlyH ∈ Lin(H|M ). Moreover, empty(MH ) holds, so β(H, i) follows from the initialization of Implementation MQFI, for all i ∈ [0..N − 1].
Induction Hypothesis: For any l > 0, assume that Theorem 6.3 holds for every H such that |H| < l.
Induction
Step: We must prove Theorem 6.3 for every H such that |H| = l. Let G be a prefix of H of length l − 1. We proceed by cases on the last step σ in H. Cases A-G are when H ends with an atomic base object step and Case H is when H ends with a non-atomic step on the target object M . In all these cases we assume that H does not contain a bad MutexQueue operation execution. Finally, Case I is when H does contain a bad MutexQueue operation execution.
Case A: step σ is a Ctr.F&I() (see line 1 of enqueue()) In this case,
and p ∈ QP rocs(MḠ), since H does not contain a bad operation execution. Then certainlyH ∈ Lin(H|M ), and MH = ⊥. Furthermore, p ∈ QP rocs(MH), and p ∈ VisP rocs(MH). Next, note that Ctr H = Ctr G + 1 and Case G: step σ is a P roc[(index G p + 1) mod N ].read() that returns ret for some ret (see line 7 of dequeue()). In this case,
Let j = index H p , k = j + 1 mod N , and q = succ(MḠ, p). Note that p ∈ QP rocs(MḠ), p ∈ VisP rocs(MḠ) and p = head(MḠ) as in Case E, soH ∈ Lin(H|M ) provided that ret = q. Also note that q = ⊥, since if F G where F ends just before p's last F&I() operation (i.e., line 6 of dequeue()) then succ(MF , p) = ⊥ follows from the arguments in Case F, and succ(MF , p) = succ(MḠ, p). Similarly, it follows that q ∈ VisP rocs(MḠ) and that q has not begun executing dequeue() by the end of G. From Lemma 6.2 and Implementation MQFI, it follows that no process has overwritten P roc[k] since q last wrote it, so P roc[k] = q, and ret = q, which implies thatH ∈ Lin(H|M ), as wanted. Now, β(H, i) for i ∈ [0..N −1], i = k follows directly from β(G, i). Finally, β(G, k) implies that Stat[k] G = 2 since q = head(MḠ), q ∈ VisP rocs(MḠ), and pred(H, q) = p is between lines 6 and 7 at the end of Case H: step σ is a non-atomic step on the target object M by process p. Subcase H-i: σ is an invocation step. ThenH =Ḡ by definition since the linearization point of every MutexQueue operation occurs after the initial invocation step. Furthermore,Ḡ ∈ Lin(H|M ) sinceḠ ∈ Lin(G|M ) and H = G • s I where s I is an invocation. Thus,H ∈ Lin(H|M ), and MH = ⊥ since MḠ = ⊥ by the IH. Subcase H-ii: σ is a response step. Then the linearization point of the operation execution corresponding to σ has occurred in G, and soḠ contains this operation execution. SinceḠ ∈ Lin(G|M ) by the IH, it follows thatḠ ∈ Lin(H|M ) provided that σ and the last step inH|p have equal return values. But the latter follows from our construction ofH. (Recall that for an operation execution that is pending in H, if the linearization point has occurred then the operation execution is completed with a matching response step inH that returns the uniquely-determined return value of the access procedure.) Similarly, it follows thatH =Ḡ. Thus,H ∈ Lin(H|M ) and MH = ⊥ sinceḠ ∈ Lin(G|M ) and MḠ = ⊥ by the IH.
Case I: H contains a bad MutexQueue operation. Let F be the prefix of H up to but not including the first invocation step σ I of a bad MutexQueue operation execution. By the IH,F ∈ Lin(F |M ) and MF = ⊥. To obtain a linearization of H|M , first let L =Ḡ • σ I , σ R where σ R is a response matching σ I , with an arbitrary return value. Since σ I corresponds to a bad operation execution, it follows that L ∈ Lin((G • σ I )|M ), and that M L = ⊥. Finally, form L ′ by appending to L a complete operation execution on M for all remaining operation executions in H|M (i.e., those that have been invoked but are not present in L), say in the order of their invocation steps in H. Once again assign the return value for each such operation execution arbitrarily. Since M L = ⊥, it follows that L ′ ∈ Lin(H|M ).
RMR Complexity
Each access procedure of Implementation MQFI performs O(1) steps since there are no loops. In particular, the RMR complexity of each access procedure is O(1).
Bounded Memory Implementation
A drawback of the above implementation is that Ctr grows without bound. We now discuss how to implement Ctr using bounded memory. One approach, used by [3] , is to atomically subtract N from Ctr whenever N − 1 is fetched from the F&I() at line 2 of enqueue(). This ensures that Ctr never grows beyond 2N − 1 (since at most N − 1 other processes can increment Ctr before N is subtracted). The drawback of this solution is that a fetch-and-add primitive is needed in addition to (or in place of) fetch-and-increment. Another solution, brought to our attention by Prasad Jayanti, is to allow Ctr to overflow, provided that it returns to zero without halting the execution. In particular, if Ctr is an unsigned m-bit integer and N divides 2 m , then it is easy to see that Implementation MQFI remains correct (i.e., the values assigned to index are as before).
7 Wait-free Implementation of MutexQueue Using Fetchand-Store
The implementation of an N -process MutexQueue presented in Figure 7 is based on the mutual exclusion algorithm of Craig [8, 7] , in particular a variant brought to our attention by Prasad Jayanti. It relies on a shared object supporting a fetch-andstore (F&S) operation, which atomically writes a variable and returns its previous value. Without loss of generality, we assume that such an object also supports an ordinary write operation. (One can always simulate a write by applying a F&S and ignoring the response.) Informally, Implementation MQFS ( Figure 7 ) works as follows. At each point in time each process p "owns" exclusively an index myIdx p of array Queue; the index owned by p changes each time the process dequeues itself (see line 10). For this reason Queue has N + 1 entries; if a process dequeues itself at a time when all others are enqueued, it needs to acquire an index different from those owned by the other processes and from the index it previously owned.
The processes currently in the queue implicitly form a list, the first element of which is the head of the queue. The shared variable Last contains the index owned by the last process in the queue. (Whenever the queue is empty, Last contains an index not currently owned by any process.) When process p enqueues itself it uses F&S on Last to find out its predecessor's index (which p records in prevIdx p ) and to atomically swap its own index into Last (see line 2). The use of F&S to atomically read and update Last ensures the integrity of the list of processes waiting in the queue; it is not possible for two processes getting enqueued concurrently to consider the same process as their predecessor.
Recall from the specification of MutexQueue that the operation isHead() has two objectives: (a) to determine whether the process p executing the operation is the head of the queue, and (b) to make p visible to its predecessor, thereby ensuring that when the predecessor dequeues itself, it will "wake up" p. In addressing the second objective we must contend with the possibility of p becoming visible to its predecessor just as that predecessor is dequeueing itself. This race condition is handled by appropriate use of F&S. We now explain how the implementation of MutexQueue achieves these two objectives.
When process p enqueues itself, it sets Queue[myIdx p ] = (myIdx p , p) (see line 1). When p dequeues itself, it sets Queue[myIdx p ] to a value different from (myIdx p , −), specifically to (prevIdx p , p) (see line 6). 5 (We use F&S for this assignment because of the race condition mentioned above, as we will explain shortly.)
When it executes operation isHead(), process p signals its predecessor that it has become visible by swapping the index it owns, myIdx p , and its ID, into the predecessor's position of array Queue, namely Queue[prevIdx p ]; it records the old value of Queue[prevIdx p ] in tempIdx p and tempId p (see line 4). With this information, p can determine if it is the head of the queue: this is the case if and only if its predecessor had dequeued itself by the time p signalled that it is visible, i.e., if and only if tempIdx p = prevIdx p (see line 5).
Finally, we explain how a process p that is dequeuing itself ensures that it "wakes up" its successor, provided that the latter is visible. is not yet visible and so p is not responsible for waking it up. Accordingly, in this case p's dequeue() operation returns −1 (see line 9).
Proof of Correctness
We proceed using the same approach as in Section 6. We denote Implementation MQFS (shown in Figure 7 ) of type M utexQueue formally as I MQFS = (P, V, H) where P = {0..N − 1} and V consists of: the base objects {Last, Queue[0..N − 1]}, denoted subsequently as the set B, and a target object M . Histories in H model the execution of Implementation MQFS in a sense analogous to the one defined in Section 6.1 for Implementation MQFI. As before, it follows easily that each call to an access procedure incurs O(1) steps, and so we focus on linearizability. To that end, we define for any H ∈ H a candidate linearizationH using the same approach as in Section 6.1. We also define bad operation executions exactly as in Section 6.1. For the candidate linearization, we define the linearization point of
• an enqueue() operation execution is the base object step Last .F&S(myIdx ) at line 2; and
• an isHead() operation execution is the base object step Queue [prevIdx ] .F&S at line 4; and
• a dequeue() operation execution is the base object step Queue [myIdx ] .F&S at line 6.
Note that as in Section 6.1, the response of a MutexQueue operation execution is determined uniquely if its linearization point has been reached. For enqueue(), the response is always OK. For isHead(), the response is true if and only if the linearization point's response is different from the value of prevIdx for the calling process. For dequeue(), the response is −1 if the F&S at line 6 returns an ordered pair of the form (myIdx , −), and is the second element in this ordered pair otherwise. In the proof of correctness of Implementation MQFS it will be useful to refer to the values of private variables at the end of histories in H. Let H ∈ H be a history such thatH ∈ Lin(H|M ) 6 and MH = ⊥. Let v p be a private variable of process p (i.e., one of myIdx p , prevIdx p or tempIdx p ). We use v H p to denote the value of v p at the end of H, assuming that each assignment to a private variable of p occurs at the same time as the response of the last base object step by p that precedes that assignment in the execution corresponding to H. Below we also use the notion of bad operation executions, defined exactly as in Section 6.1.
For any H ∈ H, p ∈ P and i ∈ [0.
.N ], we say that p owns i at the end of H if and only if myIdx H p , it follows that myIdx F r = j. Also note that no process other than r applies a MutexQueue operation execution inḠ afterF . There are two cases, each leading to a contradiction.
• If z = r then myIdx in contrast, MutexQueue contains distinct operations corresponding to these two tasks. Also, in the CC model a process in the exit protocol can wake up its successor without knowing the successor's identity. Thus, queue-based local-spin algorithms specific to the CC model operate in a mode significantly different from the one captured by the MutexQueue data type.
