Abstract-In this paper we settle an open question by determining the remote memory reference (RMR) complexity of randomized mutual exclusion, on the distributed shared memory model (DSM) with atomic registers, in a weak but natural (and stronger than oblivious) adversary model. In particular, we present a mutual exclusion algorithm that has constant expected amortized RMR complexity and is deterministically deadlock free. Prior to this work, no randomized algorithm with o(log n/ log log n) RMR complexity was known for the DSM model. Our algorithm is fairly simple, and compares favorably with one by Bender and Gilbert [11] for the CC model, which has expected amortized RMR complexity O(log 2 log n) and provides only probabilistic deadlock freedom.
I. INTRODUCTION
Mutual exclusion, introduced by Dijkstra [16] , is one of the best studied problems in concurrent computing. A mutual exclusion object (or lock) is a fundamental synchronization primitive that allows processes to coordinate their access to a shared resource, by serializing the execution of a piece of code, called critical section. At any point in time, at most one process must be in its critical section; we say that this process owns the lock. A process obtains a lock through an entry section (or capture protocol), and the owner of a lock frees up the lock by executing an exit section (or release protocol). A textbook by Raynal [33] is devoted to mutual exclusion research up to the mid 80s, and a survey by Anderson, Kim, and Herman [3] covers research between 1986 and 2003.
Early mutual exclusion algorithms did not take into account the gap between high processor speeds and the low speed/bandwidth of the processor-memory interconnect [12] . In distributed shared memory (DSM) systems, each shared variable is permanently locally accessible to a single processor and remote to all other processors. In cachecoherent (CC) systems, each processor keeps local copies of (remote) shared variables in its cache, and the consistency of copies in different caches is maintained by a coherence protocol. Memory accesses that cannot be resolved locally in DSM and CC systems are called remote memory references (RMRs) . RMRs are orders of magnitude slower than local memory accesses. Hence, the performance of many algorithms for shared memory multiprocessor systems depends critically on the number of RMRs they incur [7] , [31] .
The mutual exclusion problem inherently requires processes to busy-wait in their entry section, and thus the number of shared memory accesses cannot be bounded. Therefore, the traditional step complexity measure, which counts the number of shared memory accesses, is not useful to determine the performance of mutual exclusion algorithms. Local-spin algorithms, which perform busy-waiting by repeatedly reading locally accessible shared variables, can achieve bounded RMR complexity and have practical performance benefits [7] . Recent research has almost entirely used the RMR complexity as a metric for the performance of mutual exclusion algorithms (see, e.g., [5] - [7] , [9] - [11] , [14] , [15] , [19] , [22] , [23] , [25] - [29] , [32] ).
Using strong primitives, such as fetch&increment objects, it is possible to implement mutual exclusion so that every process incurs only a constant number of RMRs per passage through the critical section. A prominent example is the MCS lock [31] , which uses an object that allows both compare&swap (CAS) and swap operations. Other examples can be found in standard textbooks, such as [24] .
A significant amount of research has focused on determining the RMR complexity of the mutual exclusion problem if only atomic registers are available. Some common synchronization primitives, and in particular CAS and load-linked/store-conditional objects, have linearizable implementations with constant RMR complexity from registers [20] , and therefore they cannot help improving the asymptotic worst-case RMR complexity.
Unless mentioned otherwise, the results discussed below hold for the CC and the DSM model with atomic registers, and n is the number of processes. The deterministic RMR complexity of mutual exclusion is Θ(log n) RMRs per passage through the critical section. The upper bound was established by Yang and Anderson's algorithm with O(log n) worst-case RMR complexity [34] . Further, Anderson and Kim [4] conjectured that this bound is optimal. Following several lower bound proofs of increasing strength [13] , [17] , [27] , Attiya, Hendler, and Woelfel [10] proved this conjecture true.
More recently, randomized techniques have been employed to improve the efficiency of mutual exclusion algorithms. To capture how random decisions made by processes can influence the order in which processes take steps (e.g., because accesses of some shared registers may be slower than others), it is assumed that an adversary produces the schedule. Among the most common adversary models are the strong adaptive adversary, where scheduling decisions can depend on all past events, including local coin flips, and the oblivious adversary, where scheduling decisions are independent of processes' random decisions, i.e., the adversary fixes the schedule in advance. Unfortunately, little can be gained by using randomization in the strong adaptive adversary model: Giakkoupis and Woelfel [19] showed that any mutual exclusion algorithm in this model has expected RMR complexity Ω(log n/ log log n), matching an upper bound by Hendler and Woelfel [23] .
Note that unlike in deterministic algorithms, in randomized ones linearizable implementations of objects can in general not replace atomic objects, without affecting the probability distribution over possible outcomes [21] . (Linearizability is defined in Section II; an object is atomic 1 if each operation takes effect instantaneously, once invoked. In particular, multiple operations on the same atomic object do not overlap in an execution.) The known constant-RMR CAS implementations [20] , however, preserve those probability distributions against a strong adaptive adversary. Therefore, the tight Θ(log n/ log log n) expected RMR bound for mutual exclusion for the strong adaptive adversary holds even if CAS or load-linked/store-conditional objects are available, in addition to registers. Hence, it is not possible to achieve o(log n/ log log n) RMR complexity without using stronger, less common synchronization primitives.
The strong adaptive adversary constitutes a very pessimistic system assumption, as it assumes that the system reacts in the most undesirable way to random decisions made by processes. Recently, researchers have increasingly focused on finding efficient randomized algorithms for the weaker, oblivious adversary model, e.g., for test-and-set [2] , [18] or consensus [8] .
Bender and Gilbert [11] have devised a randomized mutual exclusion algorithm (which will henceforth be called BG algorithm) that achieves O(log 2 log n) expected amortized RMR complexity against the oblivious adversary, in a CC model that provides atomic registers and CAS objects. However, unlike existing randomized algorithms for the strong adaptive adversary, the BG algorithm guarantees deadlock freedom only with high probability per passage through the critical section (rather than with certainty). The BG algorithm uses CAS objects as mentioned above, and it 1 Sometimes in the literature the term "atomic" is used to denote "linearizable" (see, e.g., Lynch's textbook [30] ).
remains unknown whether a similar efficient implementation from registers exist. 2 For the DSM model, no mutual exclusion algorithm with randomized o(log n/ log log n) RMR complexity against an oblivious adversary was known, until now.
Our Contribution
We present a mutual exclusion algorithm for DSM systems that is optimal w.r.t. several parameters. In particular, it
• has constant expected amortized RMR complexity in the oblivious adversary model, • is deterministically deadlock free (and can be transformed into starvation-free using standard techniques), • can be implemented from atomic O(log n)-bit registers only. In fact, we use an adversary that is stronger than the oblivious one, and seems realistic for the DSM model. The adversary can make scheduling decisions based on limited information about the operations each process has incurred in the past and the operation it will incur in its next step. While the adversary cannot know the exact register location on which such an operation has or will be performed, it can take into account the type of this operation (read or write), and whether it is a local or remote reference.
In our presentation of the DSM algorithm, we use a single CAS object, whose only purpose is to allow processes to repeatedly elect a leader, i.e., solve name consensus. Our complexity analysis makes no assumption that the CAS object is atomic (it assumes linearizability, but even weaker consistency conditions suffice), and therefore known implementation of CAS objects from registers [20] can be used without sacrificing the asymptotic RMR complexity of our mutual exclusion algorithm.
Finally, our algorithm is fairly simple. This is in contrast to the BG algorithm, which relies on a stack of other implemented objects, such as max-registers and approximate counters with various properties.
II. MODEL
We consider the standard distributed shared memory (DSM) model, where a set {0, . . . , n − 1} of n processes communicate by executing read and write operations on shared atomic registers. The set of registers is partitioned into n memory segments, one for each process. A read or write step on a register R by process p incurs a remote memory reference (RMR) if and only if R is not in p's memory segment. In this paper we assume that some registers are remote to all processes-it is not hard to see that this assumption can be made w.l.o.g. in mutual exclusion algorithms.
A schedule is a sequence of process IDs, and yields an execution in which processes take shared memory steps in the order determined by the sequence. Processes can flip (private) coins to make random decisions. We consider an adversary that schedules processes in an adaptive way, but with limited information. When scheduling the next process to take a step, the adversary has available the following information about each past step of any process, and about the step each process is poised to execute: the type of that step, i.e., whether it is a read or write, and whether the step constitutes a remote or local reference, i.e., whether or not the affected register is in the executing process' memory segment. The exact location of the register to which a read or write operation is applied, or what value is being read or written, is not revealed to the adversary, even after the process has executed that operation. In addition, we assume that the adversary learns whenever a lock() or release() call responds. (We assume that each process calls release() immediately after termination of its lock() method call, so the adversary knows when a process is poised to call release(). Thus, it can delay those release() calls arbitrarily.)
A compare&swap (CAS) object C supports the operations C.read(), which returns the value of C, and C.CAS(old, new), which writes new into C if C = old, and otherwise does not change C. In either case, it returns the value that C had at the point immediately before the operation was applied. It is known that CAS objects can be implemented from atomic registers such that each CAS() operation incurs O(1) RMRs [20] . This implementation is linearizable: In any execution on an implemented CAS object, every CAS() operation can be associated with a linearization point between the invocation and response of that CAS(), such that ordering all CAS() operations by their linearization points yields a sequential execution that matches the specification of CAS.
The mutual exclusion problem can be specified in terms of a lock object, which supports operations lock() and release(). Each processes must alternate lock() and release() calls, starting with lock(). We say a process is in the entry section if its lock() method is pending; it is in the critical section if it has completed a lock() call but since then not called release(); and it is in the exit section while its release() method is pending. A process that is not in any of the entry, critical, or exit sections is in the remainder section.
A lock object provides the safety property of mutual exclusion, which states that no two processes can be in the critical section at the same time. Several progress conditions have been considered for mutual exclusion algorithms. The weakest standard condition is deadlock freedom, which guarantees system progress: as long as all processes that are not in the remainder section take sufficiently many steps, some process will enter the critical section.
III. THE ALGORITHM A. Main Ideas and High Level Description
We start by describing the core ideas of the algorithm. The complete pseudocode is given in Fig. 1 , but here we will make some simplifications in order to not distract from the main insights. In particular, our code uses some sequence numbers, which we omit from this description. Also, some of the variables in the pseudocode are indexed by α. We will explain the purpose of this index later, and for now we will omit α from the corresponding variable names.
To decide which process enters the critical section first, we use a simple leader election protocol. The functionality of that is provided by a CAS object, denoted S in our pseudocode, which is initially ⊥. Each process p executes S.CAS(⊥, p), and if it succeeds (i.e., the operation returns ⊥) it becomes the leader, otherwise it loses.
A basic idea (albeit one that does not work without some additional twists) is the following: The losers of the leader election try to "notify" the leader, and if successful, the leader coordinates their passage through the critical section. For this mechanism, we use the notion of a backpack. Intuitively, processes try to join the leader's backpack while it is "open". At some point, the leader "closes" its backpack, and then arranges that all processes in the backpack go through the critical section, one by one.
More precisely, the leader w has n distinct registers in its local memory segment, namely , then p "gives up", and does not wait for the leader to promote it. Finally, once the leader w has coordinated all processes from its backpack into the critical section, it resets the CAS object S by executing S.CAS(w, ⊥), so that subsequently other processes can become leaders.
While one can easily design a correct algorithm based on the above technique, this technique by itself does not achieve the desired RMR complexity, even against an oblivious adversary: The adversary could first schedule one process until it becomes the leader and has closed its backpack. After that, and before the leader resets the CAS object S, the adversary schedules all remaining processes to participate in the leader election. These processes will fail to join the backpack, and thus their RMRs are "wasted".
To motivate our second core idea, which deals with this issue, assume for a moment that processes have access to an oracle. After process w becomes the leader, the oracle provides it with exactly one ID, of a process q * chosen uniformly at random from the set M of processes that are already in the entry section or will enter the entry section before w closes its backpack. Given this information, the leader can busy-wait on B[w][q * ] (which is in w's memory segment) until q * appears there and thus has joined the backpack. Only after that, does w close its backpack and promotes all the processes it finds in B [w] [ · ] into the critical section. If the random choice of q * ∈ M is independent of the adversary's scheduling decision, then we expect that roughly half of the processes in M join w's backpack (i.e, write to B [w] [ · ]) before q * does. Hence, half of the processes in the entry section get promoted (in expectation), and thus for every constant number of RMRs, one process enters the critical section.
We now describe a randomized mechanism that provides a functionality that can replace the oracle above. We use a shared array R [1. . ], where = log n + 1, and each array entry is initially ⊥ (in our pseudocode we use again sequence numbers, so the actual initial value is different). Before participating in the leader election, each process writes its ID to an array entry R [λ] , where λ ∈ {1, . . . , } is chosen at random in such a way that λ = i with probability Θ(1/2 i ). The leader scans this array from left to right until it finds the first index i such that R[i] = ⊥. Then (slightly simplifying matters) it uses the process ID q * found in R[i − 1] in the same way as the oracle response above, i.e., it waits for q * to join its backpack. The crucial insight now is the following: Suppose M is the set of processes that write to R before the point t when q * starts to scan R, and let m = log |M | . Then with constant probability the following "good" event happened: all registers
were written by processes in M , and exactly one process in M wrote to R [m] .
Given this event, the process that wrote to R[m] is uniformly distributed over M . Hence, by waiting for that process q * to join its backpack, w ensures that it does not close its backpack before Ω(|M |) processes have also joined its backpack, in expectation. To deal with some technical issues arising from processes writing to R at different times, and to simplify the analysis, the leader actually waits for every process it finds on R, not only the "topmost" one. Clearly, waiting longer cannot make matters worse.
The mechanism above still does not guarantee that all processes have a chance of getting promoted, but only those that write to R during a specific time interval. In particular, a process that writes to R after w scanned that array may be scheduled in such a way that it has no chance of getting promoted. To describe the solution to this problem we use the notion of "good" intervals. Recall that the CAS object S gets reset to ⊥ whenever a leader finishes its exit section, before it gets captured by the next leader. A good interval starts whenever the CAS object gets reset, and ends just before the next leader starts scanning R. As argued above, our technique guarantees that if M is the set of processes that write to R during a good interval, then in expectation Ω(|M |) processes get promoted by the one that becomes the leader in that interval.
We employ the following simple trick: We use two copies of essentially the entire data structure (i.e., of almost all shared objects). In the pseudocode, we add a subscript α to each shared object, where the value of α ∈ {0, 1} indicates the copy of the object. (We say "side α" to refer to copy α of the data structure.) At the beginning of its entry section, a process chooses a side α ∈ {0, 1} uniformly at random. Then, it proceeds as described above, but uses side α of the data structure. Since there are also two CAS objects, S 0 and S 1 , we may now have two competing leaders. To synchronize between them, we use an additional 2-process lock object, L (in fact, for technical reasons, we need that L be a 4-process lock). As soon as a process becomes the leader of side α (i.e., it captures S α ), it tries to capture L; and when it has captures L, it releases it only after it resets S α . As a consequence, every point in time belongs to either a good interval on side 0, or a good interval on side 1. Hence, since processes choose α at random, whenever they write to R α they have (at least) a 1/2 probability of writing during a good interval on side α.
There are several small technical difficulties to overcome in order to make these ideas work. Many of the difficulties stem from the fact that information on R may be outdated, i.e., it is left behind by processes that did not manage to join a backpack. To deal with this and other issues we use sequence numbers that processes increment frequently, and attach to the information they write to registers. Techniques to recycle sequence numbers are known [1] , [20] and well understood. It is not difficult to bound sequence numbers so that our algorithm needs only O(log n)-bit numbers, but doing so makes the implementation more complicated and distracts from the core-ideas.
B. Implementation
The pseudocode of our algorithm is given in Fig. 1 • Rs [1. . ], for s ∈ {0, 1} and = log n +1 is an array of registers, each storing a pair in P × N 0 that is initially (0, 0).
• Ss, for s ∈ {0, 1} is a CAS object which stores values in (P × N 0 ) ∪ {(⊥, ⊥)}, and is initially (⊥, ⊥).
• L is a 4-process lock (4PLock) which can be accessed by the processes 0, . . . , 3.
• Bits, for s ∈ {0, 1}, is a Boolean register that is initially 0. If p's CAS() succeeds, p becomes the leader on side α. In that case, it reads Bit α into bit and then tries to capture lock L (in lines 8 and 9), using a virtual 2-bit ID with highorder bit α and low-order bit bit. Then, in lines 10-14, the process scans array R α , from left to right. Each time it finds a process/sequence-number pair (r, d), it checks the status of process r in A [r] . If that status is want or (p, c), then it is guaranteed that r will get promoted. All such pairs are added to a local set found, and as soon as the leader sees a process on R α that does not meet the above criterion, it stops scanning R α . In lines 15 First it flips the bit Bit α , then it resets the CAS object S α to the initial value (⊥, ⊥), and finally it releases lock L. Note that it releases L only after resetting S α . This is necessary to ensure that there is always a good interval, either on side 0 or side 1 (see the high level description in Section III-A). However, this also means that a new leader may get elected on side α before the previous leader on side α has released L. Since processes use the bit Bit α to compute their virtual IDs, it is ensured that access to lock L is still safe. (This is the reason why we need to use a 4-process lock L, instead of a 2-process one.)
Sα.CAS((p, d),(⊥, ⊥))
Method promote() uses a straightforward handshaking mechanism to facilitate the promotion. , waiting) , and is now busy-waiting in line 27. In this case, in line 39 process p notifies process r that it can enter the critical section, and then in line 40, p waits until r writes to B α [p] [r] again; r will do so at the end of its exit section in line 34.
IV. COMPLEXITY ANALYSIS
In this section we analyze the RMR complexity of our algorithm. The proofs of deadlock freedom and mutual exclusion, as well as the proofs of some claims used for the complexity analysis are omitted due to space restrictions.
We start with some terminology and some simple facts. A subscript attached to a local variable indicates the process to which the local variable belongs to; e.g., c x denotes local variable c of process x.
We consider an execution of the algorithm, and fix linearization points of all CAS() operations in an arbitrary but unique way. When we say a process executes a CAS() operation op at time t, we mean that t is the linearization point of op. Sometimes we talk about the value that the CAS object has at a certain point; with that we mean the value it would have at that point, if all CAS() operations occurred atomically at their linearization points.
For s ∈ {0, 1} and i ≥ 0, we define a phase (s, i). Phase (s, 0) starts at the beginning of the execution, and phase (s, i), for i ≥ 1, starts when for the i-th time a S s .CAS() operation in line 32 linearizes. Phase (s, i) ends when phase (s, i + 1) starts. The leader of phase (s, i) is the unique process p that executes a successful S s .CAS() operation in line 6 which linearizes during phase (s, i). We say process x gets promoted during a promote() call by process y, if x enters the critical section while y's promote() call is pending. Note that y must own lock L when it calls promote(), so no two promote() calls can overlap, and thus a process can only get promoted during a single promote() call.
Theorem 1. Consider the random execution of the algorithm scheduled by a locality-aware adversary, and let τ m be the point when the implemented lock method has been invoked m ≥ 1 times. The expected total number of RMRs incurred until point τ m is O(m).
W.l.o.g. we assume that after point τ m , the adversary does not schedule any new invocations to the lock() method; it only schedules processes that have already started a lock() operation until all such pending operations are completed. Let τ m denote that completion point. We will bound the number of RMRs incurred until point τ m , as this is clearly an upper bound on the RMRs incurred until τ m . The next lemma says that it suffices to bound instead the number of writes to arrays R s in line 5. From Claim 3, it follows that in each iteration of the forloop in lines 11-11, except possibly for the last one, process p adds to set found p a distinct pair (r, d) , not added to any other found set. For each such pair (r, d), we have that process r executes a different iteration of the while-loop in the lock() method. We charge to r the two RMRs incurred in the for-loop iteration in which (r, d) is added to found p , and charge to p the two RMRs of the last for-loop iteration. It follows that for each iteration of the while-loop executed by r or p, this process is charged at most O(1) of the RMRs incurred (by any process) in lines 11-11. Claim 4 allows us to distribute the RMR cost of a promote() operation to the processes that go through the critical section during that call, charging one RMR to each process q for each of its entries to the critical section. Since only one promote() operation can be in progress at a time (as the leader must own lock L during the call), no process is charged twice for the same passage through the critical section. Finally, we observe that each release() operation incurs O(1) RMRs.
Combining the above, we obtain that the total number of RMRs is asymptotically the same as the total number of iterations of the while-loop in the lock() method, executed by all processes, plus the total number of passages of processes through the critical section. Since for each passage, a process must execute at least one iteration of the while-loop, we conclude that the total number of RMRs is asymptotically the same as the total number of iterations of the while-loop, which is equal to the number of write operations on R 0 and R 1 . This completes the proof of Lemma 2.
Next we introduce the notions of good intervals and good writes, and show that in expectation at least half of the writes to arrays R 0 and R 1 are good. Thus, it suffice to bound the number of good writes.
For s ∈ {0, 1} and i ≥ 0, the good interval I s,i starts at the beginning of phase (s, i), and ends when the L.lock() operation in line 9 by the leader of phase (s, i) responds (i.e., when the leader has acquired that lock). A write operation on array R s in line 5 is good if it takes place in some good interval I s,i . Good intervals I s,i and I s,i with i = i do not overlap, but good intervals for different sides s may overlap. A critical observation for our analysis is that the union of all good intervals covers the complete execution. Since each process chooses the side α at random before writing to R α , with probability 1/2 that write will occur during a good interval on side α. A straight-forward application of Wald's Theorem yields the following lemma.
Lemma 5. In expectation, at least half of all write operations on arrays R 0 and R 1 are good.
Next we look at a single phase (s, i), and bound from below the number of times processes go through the critical section during that phase, in terms of the number of good writes to R s in the phase. Let k s,i be the number of good writes to array R s in phase (s, i), and let s,i be the number of passages through the critical section by processes in phase (s, i); if phase (s, i) does not exists, then k s,i = s,i = 0.
Proof:
We first give an overview of the proof. We fix the set of processes that perform a good write to array R s during phase (s, i), and condition on the event E that all the first κ = log(k s,i ) positions in R s get written by those good writes. Event E has constant probability, and implies that the leader of phase (s, i) will execute at least κ iterations of the for-loop comprising lines 11-11; and for each 1 ≤ j ≤ κ, it will add to its found set some process that wrote to R s [j] after the beginning of the phase. The leader will then have to wait in line 16 until all these processes have executed line 24, and in particular until the process p * that wrote to R s [κ] has done so. Given E, the conditional distribution of the position λ in which a process writes to R s in phase (s, i) is very close to the unconditional one, described in line 4. In particular, the probability that a process writes to the κ-
It follows that any schedule by the adversary will result in an expected number of at least Ω(k s,i ) processes that write to R s in phase (s, i), and execute line 24 before process p * does. All these processes will be promoted to the critical section before the end of the phase, as the leader has to wait for p * in line 16 before it invokes a promote() operation in line 17, and thus it will see those Ω(k s,i ) processes (in expectation) when it scans through its backpack.
We now give the detailed proof. We define three sets of processes, K s,i , M s,i , and P s,i , as follows. Set K s,i consists of all processes that perform a good write to R s during phase (s, i). Set M s,i consists of the processes that write to R s between the beginning of phase (s, i), and the point right after the leader of the phase has either completed the κ-th iteration of the for-loop in lines 11-11, where κ := log(k s,i ) , or has broken out of the loop in line 13 (whichever of the two happens first). Note that K s,i ⊆ M s,i . Finally, set P s,i contains all processes p ∈ M s,i that write to array B s in line 24 before the leader invokes the promote() operation in line 17.
We let E be the event that for each position 1 ≤ j ≤ κ of array R s , at least one of the k s,i good write operations on R s during phase (s, i) is performed on register R s [j] .
The following claim establishes that no process, which writes to R s in phase (s, i), can proceed past line 27, or, unless it is the leader of that phase, read an entry of R s , before the leader of that phase executes line 17. Between the beginning of phase (s, i) and the point when good interval I s,i ends, the leader does not access R s . Claim 7 implies that the position λ in R s , where a process p ∈ K s,i writes to when it executes line 5 during I s,i does not affect any other processes' steps, or p's steps starting with line 5 and until interval I s,i ends. Since the adversary does not know which position λ a process p chooses, and only the location of p's write to R s depends on λ, that random choice does not affect the schedule up to the point interval I s,i ends.
Fix some execution prefix E that ends at the beginning of good interval I s,i . In addition, fix all remaining random choices made by processes until the end of I s,i , except for the choice of λ that each process makes on side s. Then for every infinite sequence λ = (λ 1 , λ 2 , . . . ), the adversary schedules an execution that is uniquely determined by λ up to the end of I s,i , where the j-th process that executes line 4 on side s during I s,i chooses the value λ j in that line. Let E λ denote that unique execution up to the point when I s,i ends. Then we have that for any two λ, λ , the adversary cannot distinguish between E λ and E λ , and the only difference any process sees (if any) is the random value it chooses in line 4 on side s during I s,i . Since the adversary cannot distinguish between those two executions, K s,i is the same for both, and so is the order of all steps. Hence, κ is also fixed, and the probability of event E, that each of the values 1, . . . , κ is chosen at least once by processes in K s,i , is Let κ ≥ κ be the number of iterations of the for-loop in lines 11-11 completed by the leader of phase (s, i), without including the last iteration if it ends with a break. Let T be the point right after the end of the last iteration of the for-loop.
In the following we condition on event E and also on the value of κ . Note that E implies κ ≥ κ, by Claim 8.
We are interested in the conditional distribution of the λ-value of each process p ∈ M s,i , given E and κ = k, for k ≥ κ. At point T the adversary has no knowledge of those λ-values, except for what can be inferred from the value of κ . Hence, conditionally on events E and κ = k, the value of λ chosen by each p ∈ M s,i has no effect on the schedule and thus on p's steps during the interval starting with p's good write to R s and ending at point T . Then, for each p ∈ M s,i and 1 ≤ j ≤ κ, the probability that a given p ∈ M s,i chooses λ = j is
and
The next claim says that the leader will wait in line 16 until all processes in its found set (other than the leader itself) have executed line 24. We are interested in the total number of processes p ∈ M s,i that execute line 24 until all processes in the leader's found set (other than the leader) have executed that line. This is lower bounded by the number Y of processes p ∈ M s,i that execute line 24 until the first process p * ∈ M s,i with λ p * = κ executes that line, provided that the λ-value of the leader is not κ.
After point T , the leader executes the loop in lines 15-16 throughout which all shared memory steps are reads of registers in the leader's own local memory segment, and thus do not incur RMRs. Hence, the adversary does not gain any information about how many iterations of the for-loop have been executed by the leader, until the leader finishes that loop. Therefore, to determine Y we can assume that the λ-value of each process p ∈ M s,i (other than the leader) remain unknown until the leader has finished its loop in lines [15] [16] . Given E and κ , the probability that any given process p ∈ M s,i (including the leader) chooses λ = κ is at most π = (3/2 κ )/(1 − 1/2 κ ), by (1) . From the union bound, the probability that neither the leader nor any of the first j − 1 processes p ∈ M s,i that execute line 24 choose λ = κ is at least 1 − jπ. 
E[|P s,i |] ≥ E[|P s,i | | E] · Pr(E)
(1)
The next claim says that the leader and all p ∈ P s,i go through the critical section in phase (s, i). cl Since the leader does not belong to P s,i , it follows from Claim 10 and (3), that the expected total number of processes that go through the critical section in phase (s, i) is at least We now have all the pieces we need to prove the main result of the section, the bound on the total number of RMRs.
Proof of Theorem 1: Recall, we have assumed w.l.o.g. that the adversary schedules exactly m invocations to the implemented lock() operation. We will bound the total number of RMRs until the point τ m when all those operations are completed. The total number of passages by processes through the critical section is also m. Thus, m = s,i s,i , where the summation is over all s ∈ {0, 1} and i ≥ 0 ( s,i = 0 if phase (s, i) does not exist V. CONCLUSION For the CC model, there is currently no randomized algorithm that achieves constant RMR complexity. We believe that our techniques can be extended to the CC model. To achieve this, we are working on a mechanism that allows processes to join the backpack of a leader in a similar way as in our DSM algorithm. The naive algorithm requires the leader to scan an array of size n and incurs Ω(n) RMRs in the CC model, but we have a randomized technique that achieves something similar in O(1) RMRs. However, it is a technical challenge to combine this with the "oracle" mechanism of our DSM algorithm.
