A Solution to k-Exclusion with O(logk) RMR Complexity by Choi, Jonathan H
Dartmouth College 
Dartmouth Digital Commons 
Dartmouth College Undergraduate Theses Theses and Dissertations 
6-3-2011 
A Solution to k-Exclusion with O(logk) RMR Complexity 
Jonathan H. Choi 
Dartmouth College 
Follow this and additional works at: https://digitalcommons.dartmouth.edu/senior_theses 
 Part of the Computer Sciences Commons 
Recommended Citation 
Choi, Jonathan H., "A Solution to k-Exclusion with O(logk) RMR Complexity" (2011). Dartmouth College 
Undergraduate Theses. 68. 
https://digitalcommons.dartmouth.edu/senior_theses/68 
This Thesis (Undergraduate) is brought to you for free and open access by the Theses and Dissertations at 
Dartmouth Digital Commons. It has been accepted for inclusion in Dartmouth College Undergraduate Theses by an 
authorized administrator of Dartmouth Digital Commons. For more information, please contact 
dartmouthdigitalcommons@groups.dartmouth.edu. 
A Solution to k-Exclusion with O(log k) RMR Complexity
Dartmouth Computer Science Technical Report TR2011-682
Jonathan Choi
Thesis Advisor: Prasad Jayanti
June 3, 2011
Abstract
We specify and prove an algorithm solving k-Exclusion, a generalization of the Mutual
Exclusion problem. k-Exclusion requires that at most k processes be in the Critical Section
(CS) at once; in addition, we require bounded exit, starvation freedom and fairness properties.
The goal within this framework is to minimize the number of Remote Memory References
(RMRs) made. Previous algorithms have required Ω(k) RMRs in the worst case. Our algorithm
requires O(log k) RMRs in the worst case under the Cache-Coherent (CC) model, a considerable
improvement in time complexity.
1 Introduction
The k-Exclusion problem, first introduced by Fischer et al. [1], is a generalization of the well-known
Mutual Exclusion problem. Mutual Exclusion requires that at most one process be in the Critical
Section (CS) of code at any point, whereas k-Exclusion requires that at most k processes be in the
CS at any point for k ≥ 1. We seek to minimize the algorithm’s worst-case time complexity, as
measured in Remote Memory References (RMRs). Previous work has produced algorithms with
worst-case complexity of O(n) and O(k) RMRs, depending on the memory model used. This paper
specifies and proves an algorithm with O(log k) worst-case RMR complexity in the Cache Coherent
(CC) model.
1.1 Model
1.1.1 Basics
The basic requirement of any exclusion algorithm is to limit the number of processes in a certain
section of code, designated the Critical Section (CS). The contents of the CS are irrelevant to
k-Exclusion; however, we could imagine the CS code accessing a scarce resource. In addition,
we conventionally designate a Remainder Section, representing the code in which the k-Exclusion
routine is couched. Its contents are also irrelevant. Each process may enter from and exit to
the remainder section multiple times, with no guarantees about timing. Finally, we designate the
section of code in our algorithm before the CS the Try Section and designate the section of code
after the CS the Exit Section.
The theoretical framework for a collection of processes is that we think of each state as a
configuration, denoted C. Each process has a transition function mapping C to a set of configurations
conceptually reachable with a single step (configurations one step away from C are denoted C.s).
An execution is a sequence of configurations that can be obtained with repeated applications of
transitions function on the initial configuration. The correctness and complexity of our algorithm
must hold over all such executions.
Finally, specific processes may be either enabled or crashed. A process is enabled in a configu-
ration if and only if it will enter the CS in a finite number of its own steps, regardless of the steps
taken by other processes. A process is crashed if, in an infinite execution, it takes no more steps.
1.1.2 Memory Models and Time Complexity
There are two dominant memory paradigms in the field of distributed computing, known as the
Cache Coherent (CC) and Distributed Shared Memory (DSM) models. In the CC model, each
process has a number of local variables that are costless to access, but accessible only by that
specific process; there are also a number of shared (or global) variables that are costly to access,
but once accessed, can be cached and waited upon at no cost until the global variable is modified.
Moreover, the CC model allows each process to spin on multiple variables, paying an upfront cost
for a read and subsequently only paying when each variable is recached. (This fact is crucial in the
algorithm).
The DSM model treats local variables identically, but associates global variables with specific
processes as well. Under DSM, access to a global variable is free for the process associated with it,
and costly for every other process. That cost must be paid each time access is made, in contrast
to the CC model. Intuitively, processes in the CC model access a shared pool of global variables,
2
while processes in the DSM model each have global memory modules which can be accessed by
their peers. Each model has advantages and disadvantages; the two paradigms are known not to
be equivalent in power. DSM is typically thought of as more versatile, but CC allows strictly more
efficient solutions to k-Exclusion.
Closely associated with the two paradigms is the idea of cost and time complexity. The time
complexity of an operation is defined in terms of Remote Memory References (RMRs). RMRs occur
whenever a CC process recaches a global variable, or modifies a global variable; likewise, RMRs
occur whenever a DSM process accesses or modifies a shared variable on another process. RMR
costs are incurred by the particular process doing the reading or writing. Thus we can consider
both per-process and amortized RMR complexity.
1.1.3 Shared Objects
Finally, the types of allowable shared variables are important to the algorithm. We make reference
to three kinds: read/write objects, Fetch and Increment (F&I) objects, and Compare and Swap
(CAS) objects. The objects are distinguished by the operations that they support atomically. An
atomic operation is an operation that takes a single step in any execution.
• Read/Write: supports read and write. read(X) returns the value stored in X; write(X)
atomically overwrites the value currently in X.
• F&I: supports read and F&I. read(X) is as above; F&I(X) first returns the value in X, then
increments it. The two operations occur atomically.
• CAS: supports read and CAS. read(X) is still as above; CAS(X, a, b) atomically sets X ← b
if X = a, and does nothing if X 6= a. It returns true if the swap occurs, false otherwise.
Our algorithm uses only F&I and CAS objects; read(X) is often abbreviated by referring to
X directly. For example, line 7 of the algorithm reads a ← A[i][j], meaning that A[i][j] is read,
and the result is stored in local variable a.
The RMR complexity of each of these operations is constant for an initial read and constant
for write, F&I and CAS operations (regardless of the success of the CAS). A process spinning on
X will also incur an RMR every time X is modified. However, there are somewhat tricky border
cases. In particular, we need to define whether a failed CAS(X, a, b) causes recaching such that
any process spinning on X incurs RMRs.
1.1.4 “Smart” Cache
Whether failed CAS causes RMRs on spinning processes or not is a hardware question out of
the scope of this paper. For conservativeness, we will consider both the case where it does not
(“Smart” Cache) and the case where it does. With Smart Cache, our algorithm has O(log k) worst-
case per-process RMR complexity; without Smart Cache, it has O(log k) amortized per-process
RMR complexity.
3
1.2 k-Exclusion
1.2.1 Intuition
The basic problem of k-Exclusion is to limit the number of processes in the critical section to at
most k processes. However, this is not the only desirable property. We would like guaranteed
progress under ordinary circumstances (clearly, though, if k processes have crashed in the CS, no
further progress can occur). We would like to ensure fairness - that is, we would like to ensure that
processes that have entered the try section earlier also enter the CS earlier. (We formalize this
property with the doorway, a bounded fragment of code at the start of the algorithm). Finally, we
would like to guarantee that processes past the CS will exit in a bounded number of steps.
1.2.2 Desired Properties
• k-Exclusion: At most k processes are in the CS at any time.
• Starvation Freedom: If fewer than k processes crash outside the remainder section, a
non-crashing process in the try section eventually enters the CS.
• First-In First-Enabled: If a process p enters the doorway before another process p′, and
p′ is in the CS, then either p already entered the CS or p is enabled to enter the CS.
• Bounded Exit: All processes complete their Exit Sections in a bounded number of steps,
regardless of speeds or interleavings of the other processes.
• O(log k) RMR Complexity: Every process makes O(log k) remote memory references per
iteration of the algorithm.
1.2.3 Previous Research
Several solutions to k-Exclusion already exist in the literature, both in the cache-coherent (CC)
and distributed shared memory (DSM) models. Anderson and Moir [2] specify algorithms that
satisfy k-Exclusion in either Θ(klog(n/k)) or Θ(c) RMRs, where c is point contention. To do so,
they require Read/Write, Fetch & Add (F&A) and Compare and Swap (CAS) objects. Danek [3]
specifies and proves an algorithm with Θ(n) RMR complexity, but which requires only Read/Write
objects.
The previously known algorithm with the lowest worst-case per-process run time is by Decker
[4]. This algorithm requires Smart Cache and achieves O(k) worst-case RMRs. (It is not known
what the amortized cost of this algorithm would be in the absence of Smart Cache; its worst-
case per-process RMR complexity would be Ω(n)). The algorithm in this thesis represents an
improvement in time complexity over Decker’s, without an increase in space complexity.
Interestingly, Danek and Hadzilacos [5] provide a proof that no algorithm in the DSM model
can improve upon a Ω(n) lower bound for RMR complexity. This demonstrates that for k-Exclusion,
the CC model is significantly more powerful than the DSM model, although this is not necessarily
true in general.
4
2 Algorithm
2.1 Intuition
Those new to k-Exclusion often suggest that a queue could be used to keep track of waiting processes
(unenabled processes in the try section). Perhaps exiting processes could dequeue and enable before
exiting, which would result in constant time complexity. This naive solution fails because a process
could dequeue a waiting process and crash, thus failing to enable the waiting process and causing
starvation. Starvation can be avoided if each exiting process steps through the queue, and only
removes waiting processes from it once certain that they are enabled. However, a malign execution
could result in an exiting process being continually pre-empted by other processes. The pre-empted
process would never successfully enable a waiting process, and would therefore never exit.
This difficulty can be avoided if we bound the number of enabling attempts each exiting process
makes. This is the idea behind Decker’s algorithm, which bounds the number of attempts at O(k).
Intuitively, Decker observed that so long as each exiting process enables a corresponding k processes
in order, all required properties are satisfied.
Our algorithm takes this solution one step further. Observe that if x processes have exited,
then the first x + k processes can safely be allowed to enter. We first match each exiting process
with the latest trying process that can safely be enabled. Each exiting process then enables its
partner, as well as the k trying processes preceding its partner.
Enabling the processes naively would still require O(k) RMRs per exiting process. We make
two novel improvements that allow us to reduce this to O(log k). First, we observe that under the
CC model, a process is allowed to spin on more than one shared variable. Exiting processes can
therefore modify certain shared variables in a pattern that enables the required k waiting processes,
without performing k operations.
The second novelty is the scheme we use to match waiting and exiting processes together. We
implement a two-dimensional array A with N slots, broken up into N/k blocks, each of k size. N
is simply the minimal integer divisible by k and greater than or equal to n.
Figure 1: A, for k = 5, N = 15. (2 · 15) + 8 = 38 processes have been enabled.
Entering processes receive successive tokens. To satisfy FIFE, every process with a lower token
must be enabled earlier or simultaneously. Based on this token, each waiting process calculates its
location in A and a value for which to wait. We define this calculation so that new processes are
conceptually introduced from left to right; when wraparound occurs, the value waited for increases
by 1. In figure 1, for example, the processes with the first 38 tokens are enabled.
However, as mentioned above, trying processes spin on more than one location in A. Once a
waiting process calculates the location in A corresponding to its token - composed of a block index
and a within-block index, the latter of which we will call w - it generates Wait-Set(w), and waits
on each number in Wait-Set(w) within its assigned block. Similarly, each exiting process calculates
5
a Release-Set(r) (based on its partner’s token).
Where binary length = dlog2 ke − 1, and BP (i) is the bit-pattern of i,
Wait-Set(w) = {w ≤ b ≤ k − 1 | BP (b) = prefix(BP (w)) · 10∗} ∪ {k − 1}
Release-Set(r) = prefix(BP (r)) · 0∗
For any w ≤ r, we require that |Wait-Set(w) ∩ Release-Set(r)| ≥ 1. Moreover, we require that
∀w ≤ w′ ≤ r : min(Wait-Set(w) ∩ Release-Set(r)) ≤ min(Wait-Set(w′) ∩ Release-Set(r)). Both
these properties are proved formally below: but the intuition is that the intersection point is based
on the leftmost bit differing from r, which will be more significant for w′ than for w.
Figure 2: Wait-Set(w) and Release-Set(r), where k = 33. The element at which they intersect is
determined by the leftmost bit at which they differ.
Using the same w and r, here is a diagram of Wait-Set(w) and Release-Set(r). Note that all
elements in Wait-Set(w) are ≥ w, and all elements in Release-Set(r) are ≤ r.
Figure 3: One block, where k = 33. The shaded squares indicate Wait-Set(w) and the arrows
indicate Release-Set(r), where w = 11 and r = 19. The indices of A are numbered.
Each exiting process is guaranteed to enable its partner and every previous waiting process in
the same block. If in addition to these, it updates the k − 1th entry in the previous block, it will
have updated its partner and at least k processes prior to its partner, in the correct order.
Thus, our algorithm does the same work as prior algorithms, considerably more efficiently.
6
2.2 Variables
N = min({m ≥ n | m mod k = 0})
Shared Variables: Local Variables:
Entry = 0 t = e = i = j = w = −1
Exit = 0 j′ = r = u = −1
A[0][0..k − 1] = 1 t′ = k − 1
A[1..(N/k)− 1][0..k − 1] = 0
2.3 Routine
Main Routine: k-Exclusion
Remainder Section
1 t← F&I(Entry); [e, i, j]← parse(t)
2 wait till ∃w ∈Wait-Set(j) : A[i][w] ≥ e
Critical Section
3 t′ ← F&I(Exit) + k; j′ ← t′ mod k
4 foreach ascending r ∈ Release-Set(j′) ∪ {−1} do
5 update(t′ − j′ + r)
6 update(t′ − j′ + r)
Subroutine: update(u)
[e, i, j]← parse(u)
7 a← A[i][j]
8 if (Exit ≤ u + 2k) and (a < e) then
9 CAS(A[i][j], a, e)
Subroutine: parse(t)
[e, f ]← [bt/Nc+ 1, t mod N ]
[i, j]← [bf/kc, t mod k]
return [e, i, j]
2.4 Definitions of Wait-Set and Release-Set
Where binary length = dlog2 ke − 1,
Wait-Set(j) = {j ≤ b ≤ k − 1 | BP (b) = prefix(BP (j)) · 10∗} ∪ {k − 1}
Release-Set(j) = prefix(BP (j)) · 0∗
7
3 Proof
3.1 Wait-Set and Release-Set Properties
∀h, i, j ∈ 0..k − 1 :
(P1) Wait-Set(i), Release-Set(i) ⊆ {0..k − 1}
(P2) ∀w ∈Wait-Set(i), w ≥ i
(P3) ∀r ∈ Release-Set(i), r ≤ i
(P4) i ≤ j ⇒Wait-Set(i) ∩ Release-Set(j) 6= ∅
(P5) h ≤ i ≤ j ⇒ min(Wait-Set(h) ∩ Release-Set(j)) ≤ min(Wait-Set(i) ∩ Release-Set(j))
(P6) |Wait-Set(i)| = O(log k) and |Release-Set(i)| = O(log k)
3.2 Proof of Set Properties
3.2.1 Property 1
Proof Wait-Set and Release-Set clearly satisfy (P1). By definition, ∀w ∈ Wait-Set(i) : 0 ≤ w ≤
i ≤ k − 1. Similarly, ∀r ∈ Release-Set(i) : 0 ≤ r ≤ i ≤ k − 1. So Wait-Set(i), Release-Set(i)
⊆ 0..k − 1.
3.2.2 Properties 2 and 3
Proof The restriction that ∀w ∈Wait-Set(i), w ≥ i is built into the definition of Wait-Set (noting
that i ≤ k − 1). The restriction that ∀r ∈ Release-Set(i), r ≤ i follows from the fact that a prefix
of i is being concatenated with 0s - thus any element in Release-Set(i) is ≤ i.
3.2.3 Lemma 1
Lemma 1 ∀i, j ∈ 0..k − 1, i < j : |Wait-Set(i) ∩ Release-Set(j)| = 1
Proof
Some notation:
i1..idlogke = BP(i)
j1..jdlogke = BP(j)
Given i < j, then ∃x ∈ 0..k − 1 : ix 6= rx. Consider the smallest such x (corresponding to the most
significant bits in BP(i) and BP(j)), which we will refer to as y. If iy = 1 and jy = 0, then i > j;
this is contradictory, so iy = 0 and jy = 1.
By the definitions of Wait-Set and Release-Set,
i1..iy−110k−y−1 ∈ Wait-Set(i)
and
j1..jy0
k−y−1 ∈ Release-Set(j)
8
Because y is the smallest x such that ix 6= jx,
i1i2...iy−1 = j1j2...jy−1
Because jy = 1,
jy0
k−y−1 = 10k−y−1
Thus for s = i1..iy−110k−y−1 = j1..jy0k−y−1, s ∈Wait-Set(i) and s ∈ Release-Set(j).
Significantly for the proof of property 5, s represents a unique member of Wait-Set(i) and
Release(j) that is in both sets. To see this, consider bit positions other than y. ∀x < y, ix = jx.
Consequently, i1..ix−110k−x−1 > j1..jx0k−x−1. ∀x > y, i1..ix−110k−x−1 < j1..jx0k−x−1, since the
two will differ at bit y (which is of greater significance than all subsequent bits). The only possible
element of Wait-Set that does not fit this definition is k−1. However, Release-Set(j) contains k−1
iff j = k − 1 by (P3). So in all cases, s is unique. ∴ |Wait-Set(i) ∩ Release-Set(j)| = 1.
3.2.4 Property 4
Proof Assume i ≤ j. To prove: Wait-Set(i) ∩ Release-Set(j) 6= 0.
We will consider two cases exhausting all possibilities: either i < j, or i = j. If i < j, then by
Lemma 1, |Wait-Set(i) ∩ Release-Set(j)| = 1 6= 0. Now consider the case where i = j. Because
i ∈Wait-Set(i) and j ∈ Release-Set(j), i = j ∈Wait-Set(i) and i = j ∈ Release-Set(j). Then in all
cases, Wait-Set(i) ∩ Release-Set(j) 6= 0.
3.2.5 Property 5
Proof Assume that ∃h, i, j ∈ 0..k − 1 : h ≤ i ≤ j. To prove: min(Wait-Set(h) ∩ Release-Set(j)) ≤
min(Wait-Set(i) ∩ Release-Set(j)).
As with property 2, we will consider two cases. Either h = i or h < i. If h = i, then Wait-
Set(h) = Wait-Set(i) (since elements in Wait-Set are deterministic), and the proof of the case is
complete. If h < i, then we use Lemma 1. Consider sh = j1..jy0
k−y−1 and si = j1..jz0k−z−1 as
defined in Lemma 1 for some y and z.
Assume for contradiction that sh > si. Then y > z (since the y
th and zth bits are followed
by 0*). Noting from the lemma that hy = iz = 0, j1..jz ∈ prefix(h) and j1..jz−10 ∈ prefix(i).
This implies that h > j, which contradicts our starting assumption. So by contradiction, sh < si.
Because sh, si are the unique shared elements between Wait-Set(h), Wait-Set(i) and Release-Set(j),
they are also the minimal elements in the intersection, and this case is proven.
Then in all cases, min(Wait-Set(h) ∩ Release-Set(j)) ≤ min(Wait-Set(i) ∩ Release-Set(j)).
3.2.6 Property 6
∀i : |Wait-Set(i)| = # of distinct prefixes of i + 1, |Release-Set(i)| = # of distinct prefixes of i.
Because 0 ≤ i ≤ k − 1, # of distinct prefixes of i = O(log k). So both Wait-Set and Release-Set
are O(log k).
3.3 Definitions
En(t) = true iff ∃w ∈Wait-Set(t) : A[i][w] ≥ e, where [e, i, j]← parse(t)
Update-Set(t′) = {u = t′ − j′ + r | r ∈ Release-Set(j′) ∪ {−1}}, where j′ = t′ mod k
9
3.4 Observations
In what follows, p and q denote processes, i and j denote indices in A, and e denotes values in
A. xp denotes the value of variable x for process p. PC (program counter) indicates the line of
the algorithm that the process is about to evaluate. 〈a, b〉 indicates the set of all processes with
PC ∈ a..b. Thus 〈1, 1〉 = set of processes in the remainder section, 〈2, 2〉 = set of processes in the
try section, 〈3, 3〉 = set of processes in the critical section, and 〈4, 6〉 = set of processes in the exit
section.
Lines are numbered in the algorithm above. The update subroutine executed in 5 is renum-
bered 5.7-5.9, while the update subroutine executed in 6 is renumbered 6.7-6.9. However, lines
7-9 are still referred to directly; PCp = 9 ≡ (PCp = 5.9 ∨ PCp = 6.9).
(O1) Entry strictly increases by increments of 1 (on 1).
(O2) Exit strictly increases by increments of 1 (on 3).
(O3) ∀i, j : A[i][j] strictly increases on a successful CAS on 9.
(O4) |〈3, 3〉| ≤ |{t | En(t)}| − Exit
(O5) max({t′ ≥ −1 | ∃p : t′ = t′p}) = Exit + k − 1
(O6) ∀p 6= q : tp 6= tq
(O7) ∀t ≥ 0 : En(t) never goes from true to false.
3.5 Lemmas
The following lemmas will assist the proofs of our invariants.
Lemma 2 ∀[e, i, j]← parse(t) : t = ((e− 1)×N) + (i× k) + j
Proof
((e− 1)×N) + (i× k) + j
= ((bt/Nc+ 1)− 1)×N + b(t mod N)/kc × k + t mod k by the definition of parse
= bt/Nc ×N + t mod N by the properties of mod
= t by the properties of mod
Lemma 3 ∀u ∈ Update-Set(t′) : u ≤ t′
Proof ∀r ∈ Release-Set(j′) : r ≤ j′, by (P3). For r = −1, r < 0 ≤ j′ trivially.
∴ −j′ + r ≤ 0
∴ u = t′ − j′ + r ≤ t′
Lemma 4 ∀p : p is enabled iff En(tp).
Proof By the definition of enabled, process p is enabled if and only if it will enter the CS in a
finite number of its own steps, regardless of the steps taken by other processes. The only point in
the Entry section that a process waits is 2, where it waits until ∃w ∈ Wait-Set(jp) : A[i][w] ≥ e.
At 2, En(tp) is true iff ∃w ∈Wait-Set(jp) : A[ip][w] ≥ ep. Since A strictly increases (by (O3)), any
process for which En(tp) is true will enter the CS immediately after reading A, regardless of the
steps taken by other processes. So the lemma holds.
10
3.6 Invariants
To preserve logical consistency, invariants with lower numbers are strictly used to prove invariants
with higher numbers, and never the reverse.
(I1) Entry = |〈2, 3〉|+ Exit
(I2) ∀t ≥ 0 : En(t)⇒ t < Exit + k
(I3) |〈3, 3〉| ≤ k
(I4) ∀t ≥ 0 : t < Exit⇒ En(t)
(I5) (PCp = 6.8 ∧ Exit)⇒ A[ip][k − 1] ≥ ep
(I6) p ∈ 〈6.7, 6.9〉 : ∀q : (PCq = 9 ∧ ip = iq ∧ jp = jq ∧ aq = A[ip][jp])⇒ eq ≥ ep
(I7) p ∈ 〈6.8, 6.9〉 : A[ip][jp] = ap ∨A[ip][jp] ≥ ep
(I8) (p ∈ 〈4, 1〉 ∧ t′ 6= −1) ⇒ ∀[e, i, j] = parse(u) for u ∈ Update-Set(t′p) ∧ u < up : A[i][j] ≥
e ∨A[i][k − 1] ≥ e
(I9) (PCp = 1 ∧ t′p 6= −1)⇒ ∀[e, i, j] = parse(t′) where t ≥ 0 : A[i][j] ≥ e ∨A[i][k − 1] ≥ e
(I10) (PCp = 1 ∧ t′p 6= −1)⇒ ∀t′p − k ≤ t ≤ t′p : En(t)
(I11) ∀0 ≤ t < t′ : En(t′)⇒ En(t)
3.6.1 Direct Proof
Some of the above invariants can be proven as direct logical consequences of preceding invariants.
(I3) Proof
|〈3, 3〉| ≤ |{t | En(t)}| − Exit by (O4)
|{t | En(t)}| ≤ Exit + k by (I2)
∴ |〈3, 3〉| ≤ Exit + k − Exit
= k
(I10) Proof Assume (PCp = 1 ∧ t′p 6= 1). To prove: ∀t′p − k ≤ t ≤ t′p : En(t).
Since when PCp = 1 ∧ t′p 6= 1, up = t′p, we can combine (I8) and (I9) to yield ∀[e, i, j] =
parse(u) for u ∈ Update-Set(t′p) : A[i][j] ≥ e ∨ A[i][k − 1] ≥ e. Consider two sequences:
Sprev = [t
′
p − j′p − k..t′p − j′p − 1] and Scur = [t′p − j′p..t′p]. By the definition of j′p, for [e, i, j] =
parse(t′p − j′p), j = 0. Then by Lemma 2, Sprev renders A[i′p − 1][0..k − 1] and Scur renders
A[i′p][0..j′p].
We will first prove ∀sprev ∈ Sprev : En(sprev). Set u = t′p − j′p − 1 ∈ Update-Set(t′p). Then
j = k − 1, so A[i][k − 1] = A[i′p][k − 1] ≥ e. So ∀sprev ∈ Sprev : En(sprev).
Now we will prove ∀scur ∈ Scur : En(scur). By (P4), every scur will intersect with an element
in Update-Set(t′p). For each such intersecting element u, A[i][j] ≥ e ∨ A[i][k − 1] ≥ e where
[e, i, j] = parse(u). If A[i][k − 1] ≥ e, then En(scur) is true for all scur, since k − 1 ∈
11
Wait − Set(j) by definition. If A[i][k − 1] < e, then A[i][j] ≥ e for each u, and En(scur) is
still true for all scur. Then in all cases, ∀scur ∈ Scur : En(scur).
Thus ∀t ∈ t′p − j′p − k..t′p : En(t). Since j′p ≥ 0, we have proven the consequent.
3.6.2 Proof by Induction
An inductive proof of the correctness of our invariants follows. The proof operates for each possible
step that an arbitrary process p can take. Inductively, we assume that all of the invariants hold
for p prior to its step (when PCp = a) and prove that all of the invariants still hold after the step
(when PCp = b). We will refer to the configuration when PCp = a as C, and the configuration
when PCp = b as C.s.
Initial Configuration First, we demonstrate that all invariants hold at initialization.
(I1) Entry = Exit = 0, and since no process has executed a step, |〈2, 3〉| = 0. So the
equality holds.
(I2) Exit and A are unchanged.
(I4) Exit and A are unchanged.
(I5)-(I7) Trivially true.
(I8)-(I9) Trivially true, since t′ = −1 for all t′ at initialization.
(I11) En(t) is true for t ∈ 0..k − 1 and false for all other t, so the invariant holds.
1 → 2
(I1) Entry and |〈2, 3〉| both increase by 1, so the equality holds.
(I2),(I4) Exit and A are unchanged.
(I5)-(I9) Trivially true.
(I11) A is unchanged.
2 → 3
(I1) Entry, Exit and |〈2, 3〉| are unchanged.
(I2),(I4) Exit and A are unchanged.
(I5)-(I9) Trivially true.
(I11) A is unchanged.
3 → 4
(I1) Exit increases by 1, but |〈2, 3〉| decreases by 1, so |〈2, 3〉| + Exit is unchanged, and
the equality holds.
(I2) Exit increases by 1; if ∀t ≥ 0 : En(t) ⇒ t < Exit + k before this step, then trivially
t < Exit + k + 1 now.
(I4) By the inductive hypothesis, ∀t ≥ 0 : t ≤ Exit− 2⇒ En(t). The border case is t such
that t = Exit − 1; however, the process with such t is p. En(tp) must have been true
when p stepped 2 → 3, and by (O7), En(tp) will never go from true to false. ∴ En(tp)
is true and the invariant holds.
12
(I5)-(I7),(I9) Trivially true.
(I8) At C.s, up = t′p − 1 = min(Update-Set(t′p), and the invariant trivially holds.
(I11) A is unchanged.
4 → 5.7
(I1) Entry, Exit and |〈2, 3〉| are unchanged.
(I2),(I4) Exit and A are unchanged.
(I5)-(I7),(I9) Trivially true.
(I8) The first ascending u ∈ Update-Set(u) = t′p−1 = min(Update-Set(t′p), so the invariant
trivially holds.
(I11) A is unchanged.
5.7 → 5.8
(I1) Entry, Exit and |〈2, 3〉| are unchanged.
(I2),(I4) Exit and A are unchanged.
(I5)-(I7),(I9) Trivially true.
(I8) t′p, up and A are unchanged.
(I11) A is unchanged.
5.8 → 5.9
(I1) Entry, Exit and |〈2, 3〉| are unchanged.
(I2),(I4) Exit and A are unchanged.
(I5)-(I7),(I9) Trivially true.
(I8) t′p, up and A are unchanged.
(I11) A is unchanged.
5.8 → 6.7
(I1) Entry, Exit and |〈2, 3〉| are unchanged.
(I2),(I4) Exit and A are unchanged.
(I5)-(I7),(I9) Trivially true.
(I8) t′p, up and A are unchanged.
(I11) A is unchanged.
5.9 → 6.7
(I1) Entry, Exit and |〈2, 3〉| are unchanged.
(I2) In the update subroutine, up ∈ Release-Set(j′) ∪ {−1} from line 4. By Lemma 3,
up ≤ t′p for all such u. ∀t ≥ 0 : if ¬En(t) in C and En(t) in C.s, then ep ≥ e, ip = i
and jp ∈Wait-Set(j) for [e, i, j] = parse(t). Because ∀w ∈Wait-Set(j) : w ≥ j by (P2),
applying Lemma 2 shows that u ≥ t. Then t ≤ u ≤ t′p < Exit + k (by (O5), and the
invariant holds.
13
(I4) By (O3), this step results only in En(t) going from false to true; thus, for any t ≥ 0
where the invariant held before, it still holds.
(I5) Trivially true.
(I6) 5.9 → 6.7 results in the CAS returning either true (upon success) or false (upon
failure). Given arbitrary q such that PCq = 9∧ iq = ip ∧ jq = jp ∧ aq = A[iq][jq], we will
address both cases:
• If the CAS succeeds, then A[ip][jp] ≥ ep. Then since A[ip][jp] is strictly increasing,
any process poised to CAS successfully must have eq > A[ip][jp] ≥ ep.
• If CAS fails, then A[ip][jp] 6= ap. Because A[ip][jp] strictly increases, aq = A[ip][jp] >
ap. Assume for contradiction that eq < ep. aq > ap implies step 7 → 8 for q
occurred after 5.7 → 5.8 for p. By Lemma 3 and (O5), up ≤ t′p < Exit when
PCp = 5.7, so since Exit strictly increases, up < Exit when q ∈ 〈8, 9〉. Because
ip = iq∧ jp = jq∧ep > eq, by Lemma 2, uq ≤ up−N . Then uq ≤ up−N < Exit−N
when PCq = 8. 8 → 9 cannot have happened, since Exit > uq + 2k when PCq = 8.
Then PCq 6= 9, which is contradictory. So eq ≥ ep by contradiction.
Thus eq ≥ ep in all cases, and the invariant holds.
(I7) Trivially true.
(I8) t′p, up are unchanged. A[ip][jp] may have changed, but strictly increases; so since the
invariant held before, it must also hold now.
(I9) Trivially true.
(I11) Thanks to the inductive hypothesis, we know that the invariant holds for all t′ such
that En(t′) is true in C. We define tmax = max({t|En(t)}) in C. Thus we only need
to prove that ∀t > tmax : En(t), since by the inductive hypothesis all smaller t have
En(t) true in C, and En(t) never goes from true to false. By (I4), Exit− 1 ≤ tmax in C;
because Exit does not change in this step, this is true of C.s as well. t′ < Exit + 2k by
(I2), meaning t′ ≤ tmax + 2k in C.s.
By (P2), t′ is at most up in C.s. (We will subtext all of the local variables of p in C with
old, e.g. t′ ≤ uold). Thus we need to demonstrate that given any u ∈ Update-Set(t′p),
∀uold − k ≤ tmax < t ≤ uold : En(t).
If uold = t
′
p − j′p − 1, then the CAS is to A[i′p − 1][k − 1], and En(t) is true for the
preceding k − 1 processes, which all have k − 1 in their Wait-Set. If uold ≥ t′p − j′p, we
apply (I8). Wait-Set(t) and Release-Set(j′p) intersect for all t < j′p by (P4). Moreover,
the point of intersection for lower t ≤ point of intersection for higher t, by (P5). These
properties along with (I8) imply that either A[iold][k−1] ≥ eold, in which case En(t) for
all t in our range, or for every t there is j ∈ 0..k− 1 such that A[iold][j] ≥ eold and j is in
the Wait-Set associated with t. So in all cases, En(t) for arbitrary uold − k ≤ t ≤ uold.
6.7 → 6.8
(I1) Entry, Exit and |〈2, 3〉| are unchanged.
(I2),(I4) Exit and A are unchanged.
14
(I5) Consider ten = t
′
p − j′p + k − 1.
En(ten) iff A[ip][k − 1] ≥ ep, by Lemma 2 and since {k − 1} = Wait-Set(k − 1).
Since ∀r ∈ Release-Set : r ≤ k − 1 (by (P1), t′p ≤ up + k. Therefore ten ≤ up + k − j′p +
k − 1 < up + 2k < Exit.
ten < Exit⇒ En(ten) by (I4), implying A[ip][k − 1] ≥ ep. So the invariant holds.
(I6) A[ip][jp], ip, jp and ep are unchanged.
(I7) After the atomic step 6.7 → 6.8, A[ip][jp] = ap, so the invariant holds.
(I8) t′p, up and A are unchanged.
(I9) Trivially true.
(I11) A is unchanged.
6.8 → 6.9
(I1) Entry, Exit and |〈2, 3〉| are unchanged.
(I2),(I4) Exit and A are unchanged.
(I5) Trivially true.
(I6) A[ip][jp], ip, jp and ep are unchanged.
(I7) If after this step A[ip][jp] 6= ap, then earlier in the execution, while p ∈ 〈6.7, 6.9〉,
∃q : PCq = 9 ∧ ip = iq ∧ jp = jq ∧ aq = A[ip][jp]. By (I6), eq ≥ ep for all such q. So if
A[ip][jp] 6= ap, A[ip][jp] ≥ ep. This is logically equivalent to the invariant, so it holds.
(I8) t′p, up and A are unchanged.
(I9) Trivially true.
(I11) A is unchanged.
6.8 → 5.7
(I1) Entry, Exit and |〈2, 3〉| are unchanged.
(I2),(I4) Exit and A are unchanged.
(I5)-(I7),(I9) Trivially true.
(I8) By the inductive hypothesis, we already know that in C the invariant holds for u ∈
Update-Set(t′p) < up. Because up in C.s is the next ascending u after uold = up in C, we
need to demonstrate the consequent now holds for uold. Note that this step is taken iff
(Exit > up + k)∨ (a ≥ e) in C. In the first case, by (I5), A[ip][k− 1] ≥ ep in C. Because
A[ip][k− 1] strictly increases, [eold, iold, jold] = parse(uold) gives us A[iold][jold] ≥ eold. In
the second case, the invariant holds directly, so it holds in all cases in C.s.
(I11) A is unchanged.
6.8 → 1
(I1) Entry, Exit and |〈2, 3〉| are unchanged.
(I2),(I4) Exit and A are unchanged.
(I5)-(I7) Trivially true.
15
(I8) t′p, rp and A are unchanged.
(I9) Analogous to the proof of (I8) for 6.8 → 5.7. up is the same in C and C.s, but this
does not alter the proof.
(I11) A is unchanged.
6.9 → 5.7
(I1) Entry, Exit and |〈2, 3〉| are unchanged.
(I2),(I4) Proof analogous to step 5.9 → 6.7.
(I5)-(I7),(I9) Trivially true.
(I8) Analogous to 6.8→ 5.7, we only need to demonstrate the invariant holds for [eold, iold, jold] =
parse(up) in C. By (I7), A[ip][jp] = ap ∨ A[ip][jp] ≥ ep in C. In the first case, the CAS
succeeds and A[iold][jold] = eold. In the second case, A[iold][k−1] ≥ eold. Either way, the
invariant is satisfied in C.s.
(I11) Proof analogous to step 5.9 → 6.7.
6.9 → 1
(I1) Entry, Exit and |〈2, 3〉| are unchanged.
(I2),(I4) Proof analogous to step 5.9 → 6.7.
(I5)-(I7) Trivially true.
(I8) t′p, rp and A are unchanged.
(I9) Analogous to the proof of (I8) for 6.9 → 5.7. Note that up is the same in C and C.s.
By (I7), A[ip][jp] = ap ∨ A[ip][jp] ≥ ep in C. In the first case, the CAS succeeds and
A[ip][jp] = ep. In the second case, A[ip][k− 1] ≥ ep. Either way, the invariant is satisfied
in C.s.
(I11) Proof analogous to step 5.9 → 6.7.
3.7 Proof of Desired Properties
3.7.1 k-Exclusion
Proof (I3) directly implies k-Exclusion.
3.7.2 Bounded Exit
Proof The update subroutine requires O(1) RMRs (and O(1) computation in general). Line 3
thus require O(1) steps. Lines 4-6 require O(1) · O(|Release-Set(q′)| + 1) steps. The whole Exit
section therefore requires O(|Release-Set(q′)|) steps. |Release-Set| is O(k) by axiom 1, so Bounded
Exit is satisfied.
3.7.3 First-In First-Enabled
(I11) directly implies FIFE.
16
3.7.4 Starvation Freedom
Proof Assume for contradiction that in some configuration, fewer than k − 1 processes have
crashed, but there are processes in the try section that will never be enabled. We have already
proven bounded exit; therefore, in an infinite run, there will be a point beyond which the only
processes in the exit section are crashed and the only processes in the entry section are crashed
or not enabled. Select an arbitrary configuration C beyond this point. We designate the number
of processes crashed in the entry section crashn and the number of processes crashed in the exit
crashx.
Consider the highest t′ such that no process p with t′p is in the exit section. Designate this
value t′x ≥ Exit + k − 1 − crashx by (O5). By (I10) and (I11), we know that all processes with
t < En(tx) are enabled. Further consider the uncrashed process with lowest t in the try section.
Designate this tn ≤ Exit + crashn.
tn − t′x ≤ Exit + crashn − (Exit + k − 1− crashx)
= (crashn + crashx)− (k − 1)
≤ 0 since total crashes ≤ k − 1
∴ tn ≤ t′x
Thus the process with tn is enabled, uncrashed and in the try section. This is contradictory, since
the only processes in the try section are crashed or not enabled. Thus by contradiction, starvation
with fewer than k − 1 crashes is impossible.
3.7.5 O(logk) RMR Complexity
With Smart Cache
Proof If we do not count CAS failures as incurring RMRs, time complexity is straightforward. The
parse subroutine is local and therefore makes no RMR; the update subroutines costs O(1) RMR.
Likewise, the time complexity of 1 and 3 is O(1). 2 requires O(log k) to initialize over O(log k)
objects (by (P6)), and O(1) subsequently, because we assume that CAS failure does not incur
RMRs. The loop in 4-6 runs O(log k) times by (P6), and requires O(1) RMR per iteration from
above. Thus total RMR cost is O(1) + O(log k) + O(log k) = O(log k).
Without Smart Cache
Proof We will prove that including the cost of failed CAS writes on readers results in O(1)
additional amortized RMR if we increase N . Specifically, we set N = kn/logk.
First, note that not all CAS failures cause additional RMRs. Only when some process p
is waiting for A[i][j] ≥ e, and CAS(A[i][j], a, e′) occurs for a and e′ < e is RMR cost incurred.
Because N = kn/logk, it is impossible to advance a full round without Exit incrementing at least
(k − 1)N/logk times. Thus when an exiting process executes a failed CAS that causes RMRs, it
may execute at most one (after which it will fail to satisfy the if statement at 8).
Each such failed CAS is waited upon by ≤ k trying processes. If some p is waiting on A[i][j]
for e, then all other processes waiting on A[i][j] are either enabled or waiting for e′ ≥ e. This is
also due to the advancement required for a process to spin on A[i][j] with e > e′. The required
(k−1)N/logk increments to Exit along with (I4) imply that any process with smaller e is enabled.
Thus, since each process can cause at most 1 failed CAS affecting at most k trying processes,
and since < n processes may remain in the exit section if progress continues, each round through A
17
results in < nk RMRs from bad CAS. Since each round involves nk/logk processes, the amortized
additional cost is O(log k).
3.7.6 Space Complexity
Proof The space complexity of shared variables in our algorithm is the size of A+ 2. With Smart
Cache, the size of A ≤ n + k − 1, since N is the smallest integer ≥ n such that N mod k = 0 and
n > k. So the worst-case total space complexity is n + k + 1 = O(n). Without Smart Cache, as
discussed above, the worst-case total space complexity will be (nk/logk) + k + 1 = O(nk/logk).
3.8 Model Checking
In addition to the invariant proof, the algorithm in this paper has been checked using Leslie Lam-
port’s Temporal Logic of Actions+ (TLA+) language [6], and the associated TLA+ Model Checker
(TLC) [7]. The algorithm has been tested for deadlock, k-Exclusion and the FIFE property on
systems consisting of ≤ 6 processes with k ≤ 3. More than 58 million distinct states have been
checked without violation of any of these properties.
It should be noted that TLC does not check every possible configuration even for a bounded
number of processes, and that TLC checking therefore does not prove correctness. Moreover,
starvation freedom cannot be tested using TLC. However, model checking provides reassurance of
correctness without requiring the reader to address the complexities of invariant-based proof.
4 Further Research
The variables used in our algorithm are unbounded; a straightforward improvement would be to
bound the variables. Improvements in the time complexity of k-Exclusion may be possible as
well. In particular, modifying Wait-Set and Release-Set might result in o(log k) worst-case RMR
complexity, although no such modifications are evident. Intuition suggests that Ω(log k) is the
lower bound for k-Exclusion. ω(1) or Ω(logk) RMR lower bound proofs would help confirm this.
5 Acknowledgements
This thesis was made possible by the thoughtful collaboration of Jack Bowman, Michael Diamond,
Matthew Elkherj, Zhiyu Liu and Nancy Zheng. Thanks to Lilai Guo for her support and assistance
in proofreading. Most of all, thanks to Prasad Jayanti, whose inspiring teaching, excellent feedback
and untrammelled enthusiasm for knowledge brought this research to life.
18
References
[1] Michael J. Fischer, Nancy A. Lynch, James E. Burns, and Allan Borodin. Resource allocation
with immunity to limited process failure (preliminary report). In FOCS’79, pages 234–254,
1979.
[2] James H. Anderson and Mark Moir. Using local-spin k-exclusion algorithms to improve wait-free
object implementations. Distributed Computing, 11:141–150, 1997.
[3] Robert Danek. The k-bakery: local-spin k-exclusion using non-atomic reads and writes. In Pro-
ceeding of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing,
PODC ’10, pages 36–44, New York, NY, USA, 2010. ACM.
[4] Chase Decker and Prasad Jayanti. A Solution to k-Assignment in O(k) RMR Complexity.
Dartmouth College, 2010.
[5] Robert Danek and Vassos Hadzilacos. Local-spin group mutual exclusion algorithms. In
DISC’04, pages 71–85, 2004.
[6] Leslie Lamport. Introduction to TLA. Technical Report SRC-TN-1994-001, HP Labs, 1994.
[7] Leslie Lamport. Specifying Systems: The TLA+ Language and Tools for Hardware and Software
Engineers. Addison-Wesley Professional, 2003.
19
