A Complexity-Based Hierarchy for Multiprocessor Synchronization by Ellen, Faith et al.
ar
X
iv
:1
60
7.
06
13
9v
2 
 [c
s.D
C]
  4
 M
ay
 20
18
A Complexity-Based Hierarchy for Multiprocessor Synchronization
Faith Ellen
University of Toronto
faith@cs.toronto.edu
Rati Gelashvili
MIT
gelash@mit.edu
Nir Shavit
MIT
shanir@csail.mit.edu
Leqi Zhu
University of Toronto
lezhu@cs.toronto.edu
Abstract
For many years, Herlihy’s elegant computability-based Consensus Hierarchy has been our best ex-
planation of the relative power of various types of multiprocessor synchronization objects when used
in deterministic algorithms. However, key to this hierarchy is treating synchronization instructions as
distinct objects, an approach that is far from the real-world, where multiprocessor programs apply syn-
chronization instructions to collections of arbitrary memory locations. We were surprised to realize that,
when considering instructions applied to memory locations, the computability based hierarchy collapses.
This leaves open the question of how to better capture the power of various synchronization instructions.
In this paper, we provide an approach to answering this question. We present a hierarchy of synchro-
nization instructions, classified by the space complexity necessary to solve consensus in an obstruction-
free manner using these instructions. Our hierarchy provides a classification of combinations of known
instructions that seems to fit with our intuition of how useful some are in practice, while questioning
the effectiveness of others. In particular, we prove an essentially tight characterization of the power of
buffered read and write instructions. Interestingly, we show a similar result for multi-location atomic
assignments.
1 Introduction
Herlihy’s Consensus Hierarchy [Her91] assigns a consensus number to each object, namely, the number of
processes for which there is a wait-free binary consensus algorithm using only instances of this object and
read-write registers. It is simple, elegant and, for many years, has been our best explanation of synchroniza-
tion power.
Robustness says that, using combinations of objects with consensus numbers at most k, it is not possible
to solve wait-free consensus for more than k processes [Jay93]. The implication is that modern machines need
to provide objects with infinite consensus number. Otherwise, they will not be universal, that is, they cannot
be used to implement all objects or solve all tasks in a wait-free (or non-blocking) manner for any number
of processes [Her91, Tau06, Ray12, HS12]. Although there are ingenious non-deterministic constructions
that prove that Herlihy’s Consensus Hierarchy is not robust [Sch97, LH00], it is known to be robust for
deterministic one-shot objects [HR00] and deterministic read-modify-write and readable objects [Rup00]. It
is unknown whether it is robust for general deterministic objects.
In adopting this explanation of computational power, we failed to notice an important fact: multipro-
cessors do not compute using synchronization objects. Rather, they apply synchronization instructions
to locations in memory. With this point of view, Herlihy’s Consensus Hierarchy no longer captures the
phenomena we are trying to explain.
For example, consider two simple instructions:
• fetch-and-add(2), which returns the number stored in a memory location and increases its value by 2,
and
• test-and-set(), which returns the number stored in a memory location and sets it to 1 if it contained 0.
1
(This definition of test-and-set is slightly stronger than the standard definition, which always sets the location
to which it is applied to 1. Both definitions behave identically when the values in the location are in {0, 1}.)
Objects that support only one of these instructions have consensus number 2. Moreover, these deterministic
read-modify-write objects cannot be combined to solve wait-free consensus for 3 or more processes. However,
with an object that supports both instructions, it is possible to solve wait-free binary consensus for any
number of processes. The protocol uses a single memory location initialized to 0. Processes with input 0
perform fetch-and-add (2), while processes with input 1 perform test-and-set(). If the value returned is odd,
the process decides 1. If the value 0 was returned from test-and-set(), the process also decides 1. Otherwise,
the process decides 0.
Another example considers three instructions:
• read(), which returns the number stored in a memory location,
• decrement(), which decrements the number stored in a memory location and returns nothing, and
• multiply(x), which multiplies the number stored in a memory location by x and returns nothing.
A similar situation arises: Objects that support only two of these instructions have consensus number 1 and
cannot be combined to solve wait-free consensus for 2 or more processes. However, using an objects that
supports all three instructions, it is possible to solve wait-free binary consensus for any number of processes.
The protocol uses a single memory location initialized to 1. Processes with input 0 perform decrement(),
while processes with input 1 perform multiply(n). The second operation by each process is read(). If the
value returned is positive, then the process decides 1. If it is negative, then the process decides 0.
For randomized computation, Herlihy’s Consensus Hierarchy also collapses: randomized wait-free binary
consensus among any number of processes can be solved using only read-write registers, which have con-
sensus number 1. Ellen, Herlihy, and Shavit [FHS98] proved that Ω(
√
n) historyless objects, which support
only trivial operations, such as read , and historyless operations, such as write, test-and-set, and swap, are
necessary to solve this problem. They noted that, in contrast, only one fetch-and-increment or fetch-and-add
object suffices for solving this problem. Yet, these objects and historyless objects are similarly classified in
Herlihy’s Consensus Hierarchy (i.e. they all have consensus number 1 or 2). They suggested that the number
of instances of an object needed to solve randomized wait-free consensus among n processes might be another
way to classify the power of the object.
Motivated by these observations, we consider a classification of instruction sets based on the number
of memory locations needed to solve obstruction-free n-valued consensus among n ≥ 2 processes. Obstruc-
tion freedom is a simple and natural progress measure. Some state-of-the-art synchronization operations,
for example hardware transactions [Int12], do not guarantee more than obstruction freedom. Obstruction
freedom is also closely related to randomized computation. In fact, any (deterministic) obstruction free
algorithm can be transformed into a randomized wait-free algorithm that uses the same number of memory
locations (against an oblivious adversary) [GHHW13]. Obstruction-free algorithms can also be transformed
into wait-free algorithms in the unknown-bound semi-synchronous model [FLMS05]. Recently, it has been
shown that any lower bound on the number of registers used by obstruction-free algorithms also applies to
randomized wait-free algorithms [EGZ18].
1.1 Our Results
Let n-consensus denote the problem of solving obstruction-free n-valued consensus among n ≥ 2 processes.
For any set of instructions I, let SP(I, n) denote the minimum number of memory locations supporting I
that are needed to solve n-consensus. This is a function from the positive integers, Z+, to Z+ ∪ {∞}. For
various instruction sets I, we provide upper and lower bounds on SP(I, n). The results are summarized
in Table 1.
We begin, in Section 3, by considering the instructions
• multiply(x), which multiplies the number stored in a memory location by x and returns nothing,
• add(x), which adds x to the number stored in a memory location and returns nothing, and
• set-bit(x), which sets bit x of a memory location to 1 and returns nothing.
2
We show that one memory location supporting read() and one of these instructions can be used to solve
n-consensus. The idea is to show that these instruction sets can implement n counters in a single location.
We can then use a racing counters algorithm [AH90].
Next, we consider max-registers [AAC09]. These are memory locations supporting
• read-max (), which reads the number stored in a memory location, and
• write-max(x), which stores the number x in a memory location, provided it contains a value less than
x, and returns nothing.
In Section 4, we prove that two max registers are necessary and sufficient for solving n-consensus.
In Section 5, we prove that a single memory location supporting {read(),write(x), fetch-and-increment()}
cannot be used to solve n-consensus, for n ≥ 3. We also present an algorithm for solving n-consensus using
O(log n) such memory locations.
In Section 6, we introduce a family of buffered read and buffered write instructions Bℓ, for ℓ ≥ 1, and
show how to solve n-consensus using ⌈n
ℓ
⌉ memory locations supporting these instructions. Extending Zhu’s
n − 1 lower bound [Zhu16], we also prove that ⌈n−1
ℓ
⌉ such memory locations are necessary, which is tight
except when n− 1 is divisible by ℓ.
Our main technical contribution is in Section 7, where we show a lower bound of ⌈n−1
2ℓ
⌉ locations, even in
the presence of atomic multiple assignment. Multiple assignment can be implemented by simple transactions,
so our result implies that such transactions cannot significantly reduce space complexity. The proof further
extends the techniques of [Zhu16] via a nice combinatorial argument, which is of independent interest.
There are algorithms that solve n-consensus using n registers [AH90, BRS15, Zhu15]. This is tight by the
recent result of [EGZ18], which shows a lower bound of n registers for binary consensus among n processes
and, hence, for n-consensus. In Section 8, we present a modification of a known anonymous algorithm for
n-consensus [Zhu15], which solves n-consensus using n− 1 memory locations supporting {read(), swap(x)}.
A lower bound of Ω(
√
n) locations appears in [FHS98]. This lower bound also applies to locations that only
support test-and-set(), reset() and read() instructions.
Finally, in Section 9, we show that an unbounded number of memory locations supporting read() and
either write(1 ) or test-and-set() are necessary and sufficient to solve n-consensus, for n ≥ 3. Furthermore,
we show how to reduce the number of memory locations to O(n log n) when in addition to read(), write(0)
and write(1) are both available, or test-and-set() and reset() are both available.
Instructions I SP(I, n)
{read(), test-and-set()}, {read(),write(1)} ∞
{read(),write(1),write(0)} n (lower), O(n log n) (upper)
{read(),write(x)} n
{read(), test-and-set(), reset()} Ω(√n) (lower), O(n log n) (upper)
{read(), swap(x)} Ω(√n) (lower), n− 1 (upper)
{ℓ-buffer-read(), ℓ-buffer-write(x)} ⌈n−1
ℓ
⌉ (lower), ⌈n
ℓ
⌉ (upper)
{read(),write(x), increment()} 2 (lower), O(log n) (upper)
{read(),write(x), fetch-and-increment()}
{read-max(),write-max (x)} 2
{compare-and-swap(x, y)} {read(), set-bit(x)} 1
{read(), add(x)}, {read(),multiply(x)}
{fetch-and-add(x)}}, {fetch-and-multiply(x)}
Table 1: Space Hierarachy
2 Model
We consider an asynchronous system of n ≥ 2 processes, with ids 0, 1 . . . , n − 1, that supports a set of
deterministic synchronization instructions, I, on a set of identical memory locations. The processes take
steps at arbitrary, possibly changing, speeds and may crash at any time. Each step is an atomic invocation of
3
some instruction on some memory location by some process. Scheduling is controlled by an adversary. This
is a standard asynchronous shared memory model [AW04], with the restriction that every memory location
supports the same set of instructions. We call this restriction the uniformity requirement.
When allocated a step by the scheduler, a process performs one instruction on one shared memory location
and, based on the result, may then perform an arbitrary amount of local computation. A configuration
consists of the state of every process and the contents of every memory location.
Processes can use instructions on the memory locations to simulate (or implement) various objects. An
object provides a set of operations which processes can call to access and/or change the value of the object.
Although a memory location together with the supported instructions can be viewed as an object, we do
not do so, to emphasize the uniformity requirement.
We consider the problem of solving obstruction-free m-valued consensus in such a system. Initially, each
of the n processes has an input from {0, 1, . . . ,m− 1} and is supposed to output a value (called a decision),
such that all decisions are the same (agreement) and equal to the input of one of the processes (validity).
Once a process has decided (i.e. output its decision), the scheduler does not allocate it any further steps.
Obstruction-freedom means that, from each reachable configuration, each process will eventually decide a
value in a solo execution, i.e. if the adversarial scheduler gives it sufficiently many consecutive steps. When
m = n, we call this problem n-consensus and, when m = 2, we call this problem binary consensus. Note
that lower bounds for binary consensus also apply to n-consensus.
In every reachable configuration of a consensus algorithm, each process has either decided or has one
specific instruction it will perform on a particular memory location when next allocated a step by the
scheduler. In this latter case, we say that the process is poised to perform that instruction on that memory
location in the configuration.
3 Arithmetic Instructions
Consider a system that supports only read() and either add(x), multiply(x), or set-bit(x). We show how to
solve n-consensus using a single memory location in such a system. The idea is to show that we can simulate
certain collections of objects that can solve n-consensus.
Anm-component unbounded counter object hasm components, each with a nonnegative integral value. It
supports an increment() operation on each component, which increments the count stored in the component
by 1, and a scan() operation, which returns the counts of all m components. In the next lemma, we present
a racing counters algorithm that bears some similarity to a consensus algorithm by Aspnes and Herlihy
[AH90].
Lemma 3.1. It is possible to solve obstruction-free m-valued consensus among n processes using an
m-component unbounded counter.
Proof. We associate a separate component cv with each possible input value v. All components are initially
0. Each process alternates between promoting a value (incrementing the component associated with that
value) and performing a scan of allm components. A process first promotes its input value. After performing
a scan, if it observes that the count stored in component, cv, associated with some value v is at least n larger
than the counts stored in all other components, it returns the value v. Otherwise, it promotes the value
associated with a component containing the largest count (breaking ties arbitrarily).
If some process returns the value v, then each other process will increment some component at most
once before next performing a scan . In each of those scans, the count stored in cv will still be larger than
the counts stored in all other components. From then on, these processes will promote value v and keep
incrementing cv. Eventually, the count in component cv will be at least n larger than the counts in all other
components, and these processes will return v, ensuring agreement.
Obstruction-freedom follows because a process running on its own will continue to increment the same
component, which will eventually be n larger than the counts in all other components.
In this protocol, the counts stored in the components may grow arbitrarily large. The next lemma shows
that it is possible to avoid this problem, provided each component also supports a decrement() operation.
More formally, an m-component bounded counter object has m components, where each component stores a
count in {0, 1, . . . , 3n − 1}. It supports both increment() and decrement() operations on each component,
4
along with a scan() operation, which returns the count stored in every component. If a process ever attempts
to increment a component that has count 3n − 1 or decrement a component that has count 0, the object
breaks (and every subsequent operation invocation returns ⊥).
Lemma 3.2. It is possible to solve obstruction-free m-valued consensus among n processes using an m-
component bounded counter.
Proof. We modify the construction in Lemma 3.1 slightly by changing what a process does when it wants
to increment cv to promote the value v. Among the other components (i.e. excluding cv), let cu be one that
stores the largest count. If cu < n, it increments cv, as before. If cu ≥ n, then, instead of incrementing cv,
it decrements cu.
A component with value 0 is never decremented. This is because, after the last time some process
observed that it stored a count greater than or equal to n, each process will decrement the component at
most once before performing a scan(). Similarly, a component cv never becomes larger than 3n− 1: After
the last time some process observed it to have count less than 2n, each process can increment cv at most
once before performing a scan(). If cv ≥ 2n, then either the other components are less than n, in which case
the process returns without incrementing cv, or the process decrements some other component, instead of
incrementing cv.
In the following theorem, we show how to simulate unbounded and bounded counter objects.
Theorem 3.3. It is possible to solve n-consensus using a single memory location that supports only read()
and either multiply(x), add(x), or set-bit(x).
Proof. We first give an obstruction-free implementation of an n-component unbounded counter object using a
single location that supports read() and multiply(x). By Lemma 3.1, this is sufficient for solving n-consensus.
The location is initialized with value 1. For each v ∈ {0, . . . , n− 1}, let pv be the (v + 1)’st prime number.
A process increments component cv by performing multiply(pv). A read() instruction returns the value x
currently stored in the memory location.scan This provides a scan of all components: component cv is the
exponent of pv in the prime decomposition of x.
A similar construction does not work using only read() and add(x) instructions. For example, suppose
one component is incremented by calling add(a) and another component is incremented by calling add(b).
Then, the value ab can be obtained by incrementing the first component b times or incrementing the second
component a times.
However, we can use a single memory location that supports {read(), add(x)} to implement an n-
component bounded counter. By Lemma 3.2, this is sufficient for solving consensus. We view the value
stored in the location as a number written in base 3n and interpret the i’th least significant digit of this
number as the count of component ci−1. The location is initialized with the value 0. To increment ci, a
process performs add((3n)i), to decrement ci, it performs add(−(3n)i) and read() provides a scan of all n
components.
Finally, in systems supporting read() and set-bit(x), we can implement an n-component unbounded
counter by viewing the memory location as being partitioned into blocks, each consisting of n2 bits. Initially
all bits are 0. Each process locally stores the number of times it has incremented each component cv. To
increment component cv, process i sets the (vn+ i)’th bit in block b+1 to 1, where b is the number of times
it has previously incremented component cv. It is possible to determine to current stored in each component
via a single read(): The count stored in component cv is simply the sum of the number of times each process
has incremented cv.
4 Max-Registers
A max-register object [AAC09] supports two operations, write-max (x) and read-max (). The write-max (x)
operation sets the value of the max-register to x if x is larger than the current value and read-max () returns
the current value of the max-register (which is the largest amongst all values previously written to it). We
show that two max-registers are necessary and sufficient for solving n-consensus.
5
Theorem 4.1. It is not possible to solve obstruction-free binary consensus for n ≥ 2 processes using a single
max-register.
Proof. Consider a solo terminating execution α of process p with input 0 and a solo terminating execution β
of process q with input 1. We show how to interleave these two executions so that the resulting execution is
indistinguishable to both processes from their respective solo executions. Hence, both values will be returned,
contradicting agreement.
To build the interleaved execution, run both processes until they are first poised to perform write-max .
Suppose p is poised to perform write-max (a) and q is poised to perform write-max (b). If a ≤ b, let p take
steps until it is next poised to perform write-max or until the end of α, if it performs no more write-max
operations. Otherwise, let q take steps until it is next poised to perform write-max or until the end of
β. Repeat this until one of the processes reaches the end of its execution and then let the other process
finish.
Theorem 4.2. It is possible to solve n-consensus for any number of processes using only two max-registers.
Proof. We describe a protocol for n-consensus using two max-registers, m1 and m2. Consider the lexico-
graphic ordering ≺ on the set S = N × {0, . . . , n − 1} = {(r, x) : r ≥ 0 and x ∈ {0, . . . , n − 1}}. Let y
be a fixed prime that is larger than n. Note that, for (r, x), (r′, x′) ∈ S, (r, x) ≺ (r′, x′) if and only if
(x+ 1)yr < (x′ + 1)yr
′
. Thus, by identifying (r, x) ∈ S with (x+ 1)yr, we may assume that m1 and m2 are
max-registers defined on S with respect to the lexicographic ordering ≺.
Since no operations decrease the value in a max-register, it is possible to implement an obstruction-free
scan operation on m1 and m2 using the double collect algorithm in [AAD
+93]: A process repeatedly collects
the values in both locations (performing read-max () on each location to obtain its value) until it observes
two consecutive collects with the same values.
Initially, both m1 and m2 have value (0, 0). Each process alternately performs write-max on one com-
ponent and takes a scan of both components. It begins by performing write-max (0, x′) to m1, where
x′ ∈ {0, . . . , n−1} is its input value. If m1 has value (r+1, x) and m2 has value (r, x) in the scan, then it de-
cides x and terminates. If both m1 and m2 have value (r, x) in the scan, then it performs write-max ((r+1, x)
to m1. Otherwise, it performs write-max to m2 with the value of m1 in the scan.
To obtain a contradiction suppose that there is an execution in which some process p decides value x
and another process q decides value x′ 6= x. Immediately before its decision, p performed a scan where m1
had value (r + 1, x) and m2 had value (r, x), for some r ≥ 0. Similarly, immediately before its decision, q
performed a scan where m1 had value (r
′ + 1, x′) and m2 had value (r
′, x′), for some r′ ≥ 0. Without loss
of generality, we may assume that q’s scan occurs after p’s scan. In particular, m2 had value (r, x) before it
had value (r′, x′). So, from the specification of a max-register, (r, x)  (r′, x′). Since x′ 6= x, it follows that
(r, x) ≺ (r′, x′).
We show inductively, for j = r′, . . . , 0, that some process performed a scan in which both m1 and m2
had value (j, x′). By assumption, q performed a scan where m1 had value (r
′ + 1, x′). So, some process
performed write-max (r′ + 1, x′) on m1. From the algorithm, this process performed a scan where m1 and
m2 both had value (r
′, x′). Now suppose that 0 < j ≤ r′ and some process performed a scan in which both
m1 and m2 had value (j, x
′). So, some process performed write-max (j, x′) on m1. From the algorithm, this
process performed a scan where m1 and m2 both had value (j − 1, x′).
Consider the smallest value of j such that (r, x) ≺ (j, x′). Note that (r, x) ≺ (r′, x), so j ≤ r′. Hence,
some process performed a scan in which both m1 and m2 had value (j, x
′). Since (r, x) ≺ (j, x′), this scan
occurred after the scan by p, in which m2 had value (r, x). But m1 had value (j, x
′) in this scan and m1
had value (r+1, x) in p’s scan, so (r+1, x)  (j, x′). Since x 6= x′, it follows that (r+1, x) ≺ (j, x′). Hence
j ≥ 1 and (r, x) ≺ (j − 1, x′). This contradicts the choice of j.
5 Increment
Consider a system that supports only read(), write(x), and fetch-and-increment(). We prove that it is not
possible to solve n-consensus using a single memory location. We also consider a weaker system that supports
only read(), write(x), and increment() and provide an algorithm using O(log n) memory locations.
6
Theorem 5.1. It is not possible to solve obstruction-free binary consensus for n ≥ 2 processes using a single
memory location that supports only read(), write(x), and fetch-and-increment().
Proof. Suppose there is a binary consensus algorithm for two processes, p and q, using only one memory
location. Consider solo terminating executions α and β by p with input 0 and input 1, respectively. Let
α′ and β′ be the longest prefixes of α and β, respectively, that do not contain a write. Without loss of
generality, suppose that at least as many fetch-and-increment() instructions are performed in β′ as in α′.
Let C be the configuration that results from executing α′ starting from the initial configuration in which p
has input 0 and the other process, q has input 1.
Consider the shortest prefix β′′ of β′ in which p performs the same number of fetch-and-increment()
instructions as it performs in α′. Let C′ be the configuration that results from executing β′′ starting from
the initial configuration in which both p and q have input 1. Then q must decide 1 in its solo terminating
execution γ starting from configuration C′. However, C and C′ are indistinguishable to process q, so it
must decide 1 in γ starting from configuration C. If p has decided in configuration C, then it has decided
0, since q takes no steps in α′. Then both 0 and 1 are decided in execution α′γ starting from the initial
configuration in which p has input 0 and q has input 1. This violates agreement. Thus, p cannot have
decided in configuration C.
Therefore, p is poised to perform a write in configuration C. Let α′′ be the remainder of α, so α = α′α′′.
Since there is only one memory location, the configurations resulting from performing this write starting
from C and Cγ are indistinguishable to p. Thus, p also decides 0 starting from Cγ. But in this execution,
both 0 and 1 are decided, violating agreement.
The following well-known construction converts any algorithm for solving binary consensus to an algo-
rithm for solving n-valued consensus [HS12].
Lemma 5.2. Consider a system that supports a set of instructions that includes read() and write(x). If it
is possible solve obstruction-free binary consensus among n processes using only c memory locations, then it
is possible to solve n-consensus using only (c+ 2) · ⌈log2 n⌉ − 2 locations.
Proof. The processes agree bit-by-bit in ⌈log2 n⌉ asynchronous rounds, each using c+2 locations. A process
starts in the first round with its input value as its value for round 1. In round i, if the i’th bit of its value
is 0, a process writes its value in a designated 0-location for the round. Otherwise, it writes its value in a
designated 1-location. Then, it performs the obstruction-free binary consensus algorithm using c locations
to agree on the i’th bit, vi, of the output. If this bit differs from the i’th bit of its value, the process reads
a recorded value from the designated vi-location for round i and adopts its value for the next round. Note
that some process must have already recorded a value to this location since, otherwise, the bit v¯i would have
been agreed upon. This ensures that the values used for round i+1 are all input values and they all agree in
their first i bits. By the end, all processes have agreed on ⌈log2 n⌉ bits, i.e. on one of the at most n different
input values.scan
We can save two locations because the last round does not require designated 0 and 1-locations.
We can implement a 2-component unbounded counter, defined in Section 3, using two locations that
support read() and increment(). The values in the two locations never decrease. Therefore, as in the proof
of Theorem 4.2, a scan() operation that returns the values of both counters can be performed using the
double collect algorithm [AAD+93]. By Lemma 3.1, n processes can solve obstruction-free binary consensus
using a a 2-component unbounded counter. The next result then follows from Lemma 5.2.
Theorem 5.3. It is possible to solve n-consensus using only O(log n) memory locations that support only
read(), write(x), and increment().
6 Buffers
In this section, we consider the instructions ℓ-buffer-read() and ℓ-buffer-write(x), for ℓ ≥ 1, which generalize
read and write, respectively. Specifically, an ℓ-buffer-read instruction returns the sequence of inputs to the
ℓ most recent ℓ-buffer-write instructions applied to the memory location, in order from least recent to most
recent. If the number of ℓ-buffer-write instructions previously applied to the memory location is ℓ′ < ℓ, then
7
the first ℓ− ℓ′ elements of this sequence are ⊥. Subsequent of the conference version of this paper [EGSZ16],
Moste´faoui, Perrin, and Raynal [MPR18] defined a k-sliding window register, which is an object that supports
only k-buffer-read and k-buffer-write instructions.
We consider a system that supports the instruction set Bℓ = {ℓ-buffer-read(), ℓ-buffer-write(x )}, for some
ℓ ≥ 1. We call each memory location in such a system an ℓ-buffer and say that each memory location has
capacity ℓ. Note that a 1-buffer is simply a register. For ℓ > 1, an ℓ-buffer essentially maintains a buffer of
the ℓ most recent writes to that location and allows them to be read.
In Section 6.1, we show that a single ℓ-buffer can be used to simulate a powerful history object that can
be updated by at most ℓ processes. This will allow us to simulate an obstruction-free variant of Aspnes
and Herlihy’s algorithm for n-consensus [AH90] and, hence, solve n-consensus, using only ⌈n/ℓ⌉ ℓ-buffers.
In Section 6.2, we prove that ⌈(n− 1)/ℓ⌉ ℓ-buffers are necessary, which matches the upper bound whenever
n− 1 is not a multiple of ℓ.
6.1 Simulations Using Buffers
A history object, H , supports two operations, get-history() and append(x), where get-history() returns the
sequence of all values appended to H by prior append operations, in order. We first show that, using a
single ℓ-buffer, B, we can simulate a history object, H , that supports arbitrarily many readers and at most
ℓ different appenders.
Lemma 6.1. A single ℓ-buffer can simulate a history object on which at most ℓ different processes can
perform append() and any number of processes can perform get-history().
Proof. Without loss of generality, assume that no value is appended to H more than once. This can be
achieved by having a process include its process identifier and a sequence number along with the value that
it wants to append.
In our implementation, B is initially ⊥ and each value written to B is of the form (h, x), where h is a
history of appended values and x is a single appended value.
To implement append(x) on H , a process obtains a history, h, by performing get-history() on H and
then performs ℓ-buffer-write(h, x) on B. The operation is linearized at this ℓ-buffer-write step.
To implement get-history() on H , a process simply performs an ℓ-buffer-read of B to obtain a vector
(a1, . . . , aℓ), where aℓ is the most recently written value. The operation is linearized at this ℓ-buffer-read .
We describe how the return value of the get-history() operation is computed.
We prove that each get-history() operation, G, on H returns the sequence of inputs to all append opera-
tions on H that were linearized before it, in order from least recent to most recent. Let R be the ℓ-buffer-read
step performed by G and let (a1, . . . , aℓ) be the vector returned by R.
Note that (a1, . . . , aℓ) = (⊥, . . . ,⊥) if and only if no ℓ-buffer-write steps were performed before R i.e. if
and only if no append operations are linearized before G. In this case, the empty sequence is returned by
the get-history() operation, as required.
Now suppose that k ≥ 1 ℓ-buffer-write steps were performed on B before R, i.e. k append operations were
linearized before G. Inductively assume that each get-history() operation which has fewer than k append
operations linearized before it returns the sequence of inputs to those append operations.
If ai 6= ⊥, then ai = (hi, xi) was the input to an ℓ-buffer-write stepWi on B performed before R. Consider
the append operation Ai that performed step Wi. It appended the value xi to H and the get-history()
operation, Gi, that Ai performed returned the history hi of appended values . Let Ri be the ℓ-buffer-read
step performed by Gi. Since Ri occurred before Wi, which occurred before R, fewer than k ℓ-buffer-write
steps occurred before Ri. Hence, fewer than k append operations are linearized before Gi. By the induction
hypothesis, hi is the sequence of inputs to the append operations linearized before Gi.
If k < ℓ, then a1 = · · · = aℓ−k = ⊥. In this case, G returns the sequence xℓ−k+1, . . . , xℓ. Since
each append operation is linearized at its ℓ-buffer-write step and xℓ−k+1, . . . , xℓ are the inputs to these k
append operations, in order from least recent to most recent, G returns the sequence of inputs to the append
operations linearized before it.
So, suppose that k ≥ ℓ. Let h = hm be the longest history amongst h1, . . . ,hℓ. If h contains x1, then
G returns h′, x1, . . . , xℓ, where h
′ is the prefix of h up to, but not including, x1. By definition, a1, . . . , aℓ
are the inputs to the last ℓ-buffer-write operations prior to R, so x1, . . . , xℓ are the last ℓ values appended
8
to H prior to G. Since h contains x1, it also contains all values appended to H prior to x1. It follows that
h′ · (x1, . . . , xℓ) is the the sequence of inputs to the append operations linearized before G.
.
.
.
R1: ℓ-buffer-read()
R2: ℓ-buffer-read()
R3: ℓ-buffer-read()
.
.
.
W1: ℓ-buffer-write(a1)
W2: ℓ-buffer-write(a2)
W3: ℓ-buffer-write(a3)
ℓ concurrent append(xi) operations
returns h1
returns h2
returns h3
Rℓ : ℓ-buffer-read() ℓ-buffer-write(aℓ)
A1: append(x1)
A2: append(x2)
Aℓ: append(xℓ)
Rm : ℓ-buffer-read() ℓ-buffer-write(am)Am: append(xm)
returns hm
returns hℓ
A3: append(x3)
Figure 1: When h does not contain x1, there are ℓ concurrent append operations.
Now suppose that h does not contain x1. Then none of h1, . . . ,hℓ contain x1. Hence G1, . . . , Gℓ were
linearized before A1 and R1, . . . , Rℓ were performed prior to W1. Since step W1 occurred before W2, . . . ,Wℓ,
the operations A1, . . . , Aℓ are all concurrent with one another. This is illustated in Figure 1. Therefore
A1, . . . , Aℓ are performed by different processes. Only ℓ different processes can perform append operations
on H , so no other append operations on H are linearized between Rm and W1. Therefore, h contains all
values appended to H prior to x1. It follows that h · (x1, . . . , xℓ) is the sequence of inputs to the append
operations linearized before G.
This lemma allows us to simulate any object that supports at most ℓ updating processes using only a
single ℓ-buffer. This is because the state of an object is determined by the history of the non-trivial operations
performed on it. In particular, we can simulate an array of ℓ single-writer registers using a single ℓ-buffer.
Lemma 6.2. A single ℓ-buffer can simulate ℓ single-writer registers.
Proof. Suppose that register Ri is owned by process pi, for 1 ≤ i ≤ ℓ. By Lemma 6.1, it is possible to
simulate a history object H that can be updated by ℓ processes and read by any number of processes. To
write value x to Ri, process pi appends (i, x) to H . To read Ri, a process reads H and finds the value of the
most recent write to Ri. This is the second component of the last pair in the history whose first component
is i.
Thus, we can use ⌈n
ℓ
⌉ ℓ-buffers to simulate n single-writer registers. An n-component unbounded counter
shared by n processes can be implemented in an obstruction-free way from n single-writer registers. Each
process records the number of times it has incremented each component in its single-writer register. An
obstruction-free scan() can be performed using the double collect algorithm [AAD+93] and summing. Hence,
by Lemma 3.1 we get the following result.
Theorem 6.3. It is possible to solve n-consensus using only ⌈n/ℓ⌉ ℓ-buffers.
6.2 A Lower Bound
In this section, we prove a lower bound on the number of memory locations (supporting ℓ-buffer-read() and
ℓ-buffer-write(x)) necessary for solving obstruction-free binary consensus among n ≥ 2 processes.
In any configuration, location r is covered by process p if p is poised to perform ℓ-buffer-write on r. A
location is k-covered by a set of processes P in a configuration if there are exactly k processes in P that cover
9
it. A configuration is at most k-covered by P , if every process in P covers some location and no location is
k′-covered by P , for any k′ > k.
Let C be a configuration and let Q be a set of processes, each of which is poised to perform ℓ-buffer-write
in C. A block write by Q from C is an execution, starting from C, in which each process in Q takes exactly
one step. If a block write is performed that includes ℓ different ℓ-buffer-write instructions to the same
location, and then some process performs ℓ-buffer-read on that location, the process gets the same result
regardless of the value of that location in C.
We say that a set of processes P can decide v ∈ {0, 1} from a configuration C if there exists a P-only
execution from C in which v is decided. If P can decide both 0 and 1 from C, then P is bivalent from C.
To obtain the lower bound, we extend the proof of the n − 1 lower bound on the number of registers
required for solving n-process consensus [Zhu16]. We also borrow intuition about reserving executions
from the Ω(n) lower bound for anonymous consensus [Gel15]. The following auxiliary lemmas are largely
unchanged from [Zhu16]. The main difference is that we only perform block writes on ℓ-buffers that are
ℓ-covered by P .
Lemma 6.4. There is an initial configuration from which the set of all processes in the system is bivalent.
Proof. Consider an initial configuration, I, with two processes p0 and p1, such that pv starts with input v,
for v ∈ {0, 1}. Observe that {pv} can decide v from I since, initially, I is indistinguishable to pv from the
configuration where every process starts with input v. Thus, {p0, p1} is bivalent from I and, therefore, so is
the set of all processes.
Lemma 6.5. Let C be a configuration and Q be a set processes that is bivalent from C. Suppose C is at
most ℓ-covered by a set of processes R, where R ∩ Q = ∅. Let L be a set of locations that are ℓ-covered by
R in C. Let β be a block write from C by the set of ℓ · |L| processes in R that cover L. Then there exists a
Q-only execution ξ from C such that R∪Q is bivalent from Cξβ and, in configuration Cξ, some process in
Q covers a location not in L.
Proof. Suppose some process p ∈ R can decide some value v ∈ {0, 1} from configuration Cβ and ζ is a
Q-only execution from C in which v¯ is decided. Let ξ be the longest prefix of ζ such that p can decide v
from Cξβ. Let δ be the next step by q ∈ Q in ζ after ξ.
If δ is an ℓ-buffer-read or is an ℓ-buffer-write to a location in L, then Cξβ and Cξδβ are indistinguishable
to p. Since p can decide v from Cξβ, but p can only decide v¯ from Cξδβ, δ must be an ℓ-buffer-write to a
location not in L. Thus, in configuration in Cξ, q covers a location not in L and Cξβδ is indistinguishable
from Cξδβ to process p. Therefore, by definition of ξ, p can only decide v¯ from Cξβδ and p can decide v
from Cξβ. This implies that {p, q} is bivalent from Cξβ.
The next result says that if a set of processes is bivalent in some configuration, then it is possible to
reach a configuration from which 0 and 1 can be decided in solo executions. It does not depend on what
instructions are supported by the memory.
Lemma 6.6. Suppose U is a set of at least two processes that is bivalent from configuration C. Then it
is possible to reach, via a U-only execution from C, a configuration, C′, such that, for i = 0, 1, there is a
process qi ∈ U that decides i from C′.
Proof. Let D be the set of all configurations from which U is bivalent and which are reachable from C by
a U-only execution. Let k be the smallest integer such that there exist a configuration C′′ ∈ D and a set
U ′ ⊆ U of k processes that is bivalent from C′′. Pick any such C′′ ∈ D and let U ′ ⊆ U be a set of k processes
that is bivalent from C′′. Since each process p has only one terminating solo execution from C′′ and it
decides only one value in this execution, it follows that k ≥ 2.
Consider a process p ∈ U ′ and let U ′′ = U ′−{p} be the set of remaining processes in U ′. Since |U ′′| = k−1,
there exists v ∈ {0, 1} such that U ′′ can only decide v from C′′. Let q ∈ U ′′. Then q decides v from C′′. If p
decides v¯ from C′′, then p and q satisfy the claim for C′ = C′′. So, suppose that p decides v from C′′.
Since U ′ is bivalent from C′′, there is a U ′-only execution α from C′′ that decides v¯. Let α′ be the longest
prefix of α such that both p and U ′′ can only decide v from C′′α′. Note that α′ 6= α, because v¯ is decided
in α. Let δ be the next step in α after α′. Then either p or U ′′ can decide v¯ from C′′α′δ.
10
First, suppose that δ is a step by a process in U ′′. Since U ′′ can only decide v from C′′α′, U ′′ can only
decide v from C′′α′δ. Therefore, p decides v¯ from C′′α′δ. Since q ∈ U ′′ decides v from C′′α′δ, p and q satisfy
the claim for C′ = C′′α′δ.
Finally, suppose that δ is a step by p. Since p decides v from C′′α′, p decides v from C′′α′δ. Therefore,
U ′′ can decide v¯ from C′′α′δ. However, |U ′′| = k − 1. By definition of k, U ′′ is not bivalent from C′′α′δ.
Therefore U ′′ can only decide v¯ from C′α′δ. Since q ∈ U ′′ decides v¯ from C′′α′δ, p and q satisfy the claim
for C′ = C′′α′δ.
Similar to the induction used by Zhu [Zhu16], from a configuration that is at most ℓ-covered by a set
of processes R, we show how to reach another configuration that is at most ℓ-covered by R and in which
another process z 6∈ R covers a location that is not ℓ-covered by R.
Lemma 6.7. Let C be a configuration and let P be a set of n ≥ 2 processes. If P is bivalent from C, then
there is a P-only execution α starting from C and a set Q ⊆ P of two processes such that Q is bivalent from
Cα and Cα is at most ℓ-covered by the remaining processes P −Q.
Proof. By induction on |P|. The base case is when |P| = 2. Let Q = P and let α be the empty execution.
Since P −Q = ∅, the claim holds.
Now let |P| > 2 and suppose the claim holds for |P|−1. By Lemma 6.6, there exist a P-only execution γ
starting from C and a set Q ⊂ P of two processes that is bivalent from D = Cγ. Pick any process z ∈ P−Q.
Then P − {z} is bivalent from D because Q is bivalent from D.
We construct a sequence of configurations D0, D1, . . . reachable from D such that, for all i ≥ 0, the
following properties hold:
1. there exists a set of two processes Qi ⊆ P − {z} such that Qi is bivalent from Di,
2. Di is at most ℓ-covered by the remaining processes Ri = (P − {z})−Qi, and
3. if Li is the set of locations that are ℓ-covered by Ri in Di, then Di+1 is reachable from Di by a
(P − {z})-only execution αi which contains a block write βi to Li by ℓ · |Li| processes in Ri.
By the induction hypothesis applied to D and P − {z}, there is a (P − {z})-only execution η starting
from D and a set Q0 ⊆ (P −{z}) of two processes such that Q0 is bivalent from D0 = Dη and D0 is at most
ℓ-covered by R0 = (P − {z})−Q0.
Now suppose that Di is a configuration reachable from D and Qi and Ri are sets of processes that satisfy
all three conditions.
By Lemma 6.5 applied to configuration Di, there is a Qi-only execution ξi such that Ri ∪Qi = P − {z}
is bivalent from Diξiβi, where βi is a block write to Li by ℓ · |Li| processes in Ri. Applying the induction
hypothesis to Diξiβi and P − {z}, we get a (P − {z})-only execution ψi leading to a configuration Di+1 =
Diξiβiψi, in which there is a set, Qi+1, of two processes such that Qi+1 is bivalent from Di+1. Additionally,
Di+1 is at most ℓ-covered by the set of remaining processesRi+1 = (P−{z})−Qi+1. Note that the execution
αi = ξiβiψi contains the block write βi to Li by ℓ · |Li| processes in Ri.
Since there are only finitely many locations, there exists 0 ≤ i < j such that Li = Lj. Next, we insert
steps of z that cannot be detected by any process in P − {z}. Consider any {z}-only execution ζ from Diξi
that decides a value v ∈ {0, 1}. If ζ does not contain any ℓ-buffer-write to locations outside Li, then Diξiζβi
is indistinguishable from Diξiβi to processes in P − {z}. Since Diξiβi is bivalent for P − {z}, there exists a
P − {z} execution from Diξiβi and, hence, from Diξiζβi that decides v¯, contradicting agreement. Thus ζ
contains an ℓ-buffer-write to a location outside Li. Let ζ
′ be the longest prefix of ζ that does not contain an
ℓ-buffer-write to a location outside Li. Then, in Diξiζ
′, z is poised to perform an ℓ-buffer-write to a location
outside Li = Lj .
Diξiζ
′βi is indistinguishable from Diξiβi to P −{z}, so the (P −{z})-only execution ψiαi+1 · · ·αj−1 can
be applied from Diξiζ
′βi. Let α = γηα0 · · ·αi−1ξiζ′βiψiαi+1 · · ·αj−1. Every process in P − {z} is in the
same state in Cα as it is in Dj. In particular, Qj ⊆ P−{z} is bivalent from Dj and, hence, from Cα. Every
location is at most ℓ-covered by Rj = (P − {z})−Qj in Dj and, hence, in Cα. Moreover, since z takes no
steps after Diξiζ
′, z covers a location not in Lj in configurations Dj and Cα. Therefore, every location is
at most ℓ-covered by Rj ∪ {z} = P −Qj in Cα.
11
Finally, we can prove the main theorem.
Theorem 6.8. Consider a memory consisting of ℓ-buffers. Then any obstruction-free binary consensus
algorithm for n processes uses at least ⌈(n− 1)/ℓ⌉ locations.
Proof. Consider any obstruction-free binary consensus algorithm for n processes. By Lemma 6.4, there exists
an initial configuration from which the set of all n processes, P , is bivalent. Lemma 6.7 implies that there
is a configuration, C, reachable from this initial configuration and a set Q ⊆ P , of two processes such that
Q is bivalent from C and C is at most ℓ-covered by the remaining processes R = P −Q. By the pigeonhole
principle, R covers at least ⌈(n− 2)/ℓ⌉ ≥ ⌈(n− 1)/ℓ⌉ − 1 different locations.
Suppose that R covers exactly ⌈(n− 2)/ℓ⌉ different locations and ⌈(n− 2)/ℓ⌉ < ⌈(n− 1)/ℓ⌉. Then n− 2
is a multiple of ℓ and every location covered by R is, in fact, ℓ-covered by R. Since Q is bivalent from C,
Lemma 6.5 implies that there is a Q-only execution ξ such that some process in Q covers a location that is
not covered by R. Hence, there are at least ⌈(n− 2)/ℓ⌉+ 1 = ⌈(n− 1)/ℓ⌉ locations.
The lower bound in Theorem 6.8 can be extended to a heterogeneous setting, where the capacities of
different memory locations are not necessarily the same. To do so, we extend the definition of a configuration
C being at most ℓ-covered by a set of processes P . Instead, we require that the number of processes in P
covering each location is at most its capacity. Then we consider block writes to a set of locations containing
ℓ different ℓ-buffer-write operations to each ℓ-buffer in the set. The general result is that, for any algorithm
which solves consensus for n processes and satisfies nondeterministic solo termination, the sum of capacities
of all buffers must be at least n− 1.
The lower bound also applies to systems in which the return value of every non-trivial instruction on a
memory location does not depend on the value of that location and the return value of any trivial instruction
is a function of the sequence of the preceding ℓ non-trivial instructions performed on the location. This is
because such instructions can be implemented by ℓ-buffer-read and ℓ-buffer-write instructions. We record
each invocation of a non-trivial instruction using ℓ-buffer-write. The return values of these instructions can
be determined locally. To implement a trivial instruction, we perform ℓ-buffer-read , which returns a sequence
containing the description of the last ℓ non-trivial instructions performed on the location. This is sufficient
to determine the correct return value.
7 Multiple Assignment
With m-register multiple assignment, we can atomically write to m locations. This instruction plays an
important role in the Consensus Hierarchy [Her91], as m-register multiple assignment can used to solve
wait-free consensus for 2m− 2 processes, but not for 2m− 1 processes.
In this section, we explore whether multiple assignment could improve the space complexity of solv-
ing obstruction-free consensus. A practical motivation for this question is that obstruction-free multiple
assignment can be easily implemented using a simple transaction.
We prove a lower bound that is similar to the lower bound in Section 6.2. Suppose ℓ-buffer-read()
and ℓ-buffer-write(x) instructions are supported on every memory location in a system and, for any subset
of locations, a process can atomically perform one ℓ-buffer-write instruction per location. Then ⌈n/2ℓ⌉
locations are necessary for n processes to solve binary consensus. As in Section 6.2, this result can be further
generalized to a heterogeneous setting and different sets of instructions.
The main technical difficulty is proving an analogue of Lemma 6.5. In the absence of multiple assignment,
if β is a block write to a set of ℓ-covered locations, L, and δ is an ℓ-buffer-write to a location not in L, then β
and δ commute (in the sense that the configurations resulting from performing βδ and δβ are indistinguishable
to all processes). However, a multiple assignment δ can atomically perform ℓ-buffer-write to many locations,
including locations in L. Thus, it may be possible for processes to distinguish between βδ and δβ. Using a
careful combinatorial argument, we construct two blocks of multiple assignments, β1 and β2, such that, in
each block, ℓ-buffer-write is performed at least ℓ times on each location in L and is not performed on any
location outside of L. Given this, we can show that β1δβ2 and δβ1β2 are indistinguishable to all processes.
This is enough to prove an analogue of Lemma 6.5.
First, we define a notion of covering for this setting. In configuration C, process p covers location r if
p is poised to perform a multiple assignment that includes an ℓ-buffer-write to r. The next definition is
12
key to our proof. Suppose that, in some configuration C, each process in P is poised to perform a multiple
assignment. A k-packing of P in C is a function π mapping each process in P to some memory location it
covers such that no location r has more than k processes mapped to it (i.e., |π−1(r)| ≤ k). When π(p) = r
we say that π packs p in location r. A k-packing may not always exist or there may be many k-packings,
depending on the configuration, the set of processes, and the value of k. A location r is fully k-packed by P
in configuration C, if there is a k-packing of P in C and all k-packings of P in C pack exactly k processes
in r.
Suppose that, in some configuration, there are two k-packings of the same set of processes, but the first
packs more processes in some location r than the second. We show there is a location r′ in which the
first packing packs fewer processes than the second and there is a k-packing which, as compared to the
first packing, packs one less process in location r, one more process in location r′, and the same number of
processes in all other locations. The proof relies on existence of a certain Eulerian path in a multigraph that
we build to represent these two k-packings.
Lemma 7.1. Suppose g and h are two k-packings of the same set of processes P in some configuration C
and r1 is a location such that |g−1(r1)| > |h−1(r1)| (i.e., g packs more processes in r1 than h does). Then,
there exists a sequence of locations, r1, r2, . . . , rt, and a sequence of distinct processes, p1, p2, . . . , pt−1, such
that |h−1(rt)| > |g−1(rt)| (i.e., h packs more processes in rt than g), and g(pi) = ri and h(pi) = ri+1 for
1 ≤ i ≤ t− 1. Moreover, for 1 ≤ j < t, there exists a k-packing g′ such that g′ packs one less process than
g in rj , g
′ packs one more process than g in rt, g
′ packs the same number of processes as g in all other
locations, and g′(q) = g(q) for all q 6∈ {pj , . . . , pt−1}.
Proof. Consider a multigraph with one node for each memory location in the system and one directed edge
from node g(p) to node h(p) labelled by p, for each process p ∈ P . The in-degree of any node v is |h−1(v)|,
which is the number of processes that are packed into memory location v by h, and the out-degree of node
v is |g−1(v)|, which is the number of processes that are packed in v by g.
Now, consider any maximal Eulerian path in this multigraph starting from the node r1. This path consists
of a sequence of distinct edges, but may visit the same node multiple times. Let r1, . . . , rt be the sequence
of nodes visited and let pi be the labels of the traversed edges, in order. Then g(pi) = ri and h(pi) = ri+1
for 1 ≤ i ≤ t − 1. The edges in the path are all different and each is labelled by a different process, so the
path has length at most |P|. By maximality, the last node in the sequence must have more incoming edges
than outgoing edges, so |h−1(rt)| > |g−1(rt)|.
Let 1 ≤ j < t. We construct g′ from g by re-packing each process pi from ri to ri+1 for all j ≤ i < t.
Then g′(pi) = ri+1 for j ≤ i < t and g′(p) = g(p) for all other processes p. Notice that pi covers ri+1, since
h(pi) = ri+1 and h is a k-packing. As compared to g, g
′ packs one less process in rj , one more process in
rt, and the same number of processes in every other location. Since h is a k-packing, it packs at most k
processes in rt. Because g is a k-packing that packs less processes in rt than h, g
′ is also a k-packing.
Let P be a set of processes, each of which is poised to perform a multiple assignment in some configuration
C. A block multi-assignment by P from C is an execution starting at C, in which each process in P takes
exactly one step.
Consider some configuration C and a set of processes R such that there is a 2ℓ-packing π of R in C.
Let L be the set of all locations that are fully 2ℓ-packed by R in C, so π packs exactly 2ℓ processes from
R in each location r ∈ L. Partition the 2ℓ · |L| processes packed by π in L into two sets, R1 and R2, each
containing ℓ · |L| processes, such that, for each location r ∈ L, ℓ of the processes packed in r by π belong to
R1 and the other ℓ belong to R2. For i ∈ {1, 2}, let βi be a block multi-assignment by Ri.
Notice that, for any location r ∈ L, the outcome of any ℓ-buffer-read on r after βi does not depend on
multiple assignments that occurred prior to βi. Moreover, we can prove the following crucial property about
these block multi-assignments to fully packed locations.
Lemma 7.2. Neither β1 nor β2 involves an ℓ-buffer-write to a location outside of L.
Proof. Assume the contrary. Let q ∈ R1 ∪ R2 be a process with π(q) ∈ L such that, in C, q also covers
some location r1 6∈ L. If |π−1(r1)| < 2ℓ, then there is another 2ℓ packing of R in C, which is the same as π,
except that it packs q in location r1 instead of π(q). However, this packing packs fewer than 2ℓ processes in
π(q) ∈ L, contradicting the definition of L. Therefore |π−1(r1)| = 2ℓ, i.e., π packs exactly 2ℓ processes in r1.
13
Since L is the set of all fully 2ℓ-packed locations, there exists a 2ℓ-packing h, which packs strictly fewer
than 2ℓ processes in r1 6∈ L. From Lemma 7.1 with g = π and k = 2ℓ, there is a sequence of locations,
r1, . . . , rt, and a sequence of processes, p1, . . . , pt−1, such that |h−1(rt)| > |π−1(rt)|. Since h is a 2ℓ-packing,
it packs at most 2ℓ processes in rt and, hence, π packs strictly less than 2ℓ processes in rt. Thus, rt 6∈ L.
We consider two cases.
First, suppose that q 6= pi for all i = 1, . . . , t− 1, i.e., q does not occur in the sequence p1, . . . , pt−1. By
the second part of Lemma 7.1 with j = 1, there is a 2ℓ-packing π′ that packs less than 2ℓ processes in r1,
one more process than π in rt, and the same number of processes as π in all other locations. In particular,
π′ packs exactly 2ℓ processes in each location in L, including π(q). Moreover, π′(q) = π(q), since q does not
occur in the sequence p1, . . . , pt−1. Consider another 2ℓ packing of R in C, which is the same as π′, except
that it packs q in location r1 instead of location π(q). However, this packing packs fewer than 2ℓ processes
in π(q) ∈ L, contradicting the definition of L.
Now, suppose that q = ps, for some s ∈ {1, . . . , t − 1}. Since rs = π(ps) = π(q) ∈ L, it follows that
|π−1(rs)| = 2ℓ. By the second part of Lemma 7.1 with j = s, there is a 2ℓ-packing that packs less than 2ℓ
processes in rs, one more process than π in rt, and the same number of processes as π in all other locations.
Since rs ∈ L, this contradicts the definition of L.
Thus, in configuration C, every process in R1 ∪R2 only covers locations in L.
We can now prove a lemma that replaces Lemma 6.5 in the main argument.
Lemma 7.3. Let Q be a set of processes disjoint from R that is bivalent from C. Then there exists a Q-only
execution ξ from C such that R ∪ Q is bivalent from Cξβ1 and, in configuration Cξ, some process in Q
covers a location not in L.
Proof. Suppose some process p ∈ R can decide some value v ∈ {0, 1} from configuration Cβ1β2 and ζ is a
Q-only execution from C in which v¯ is decided. Let ξ be the longest prefix of ζ such that p can decide v
from Cζβ1β2. Let δ be the next step by q ∈ Q in ζ after ξ.
If δ is an ℓ-buffer-read or a multiple assignment involving only ℓ-buffer-write operations to locations in
L, then Cξβ and Cξδβ are indistinguishable to p. Since p can decide v from Cξβ1β2, but p can only decide
v¯ from Cξδβ1β2, δ must be a multiple assignment that includes an ℓ-buffer-write to a location not in L.
Thus, in configuration Cξ, q covers a location not in L. For each location r ∈ L, the value of r is the same in
Cξδβ1β2 as it is in Cξβ1δβ2 due to the block multi-assignment β2. By Lemma 7.2, for each location r 6∈ L,
neither β1 nor β2 performs an ℓ-buffer-write to r, so the value of r is the same in Cξδβ1β2 as it is in Cξβ1δβ2.
Since the state of process p is the same in configuration Cξβ1δβ2 and Cξδβ1β2, these two configurations are
indistinguishable to p.
Therefore, by definition of ξ, p can only decide v¯ from Cξβ1δβ2 and p can decide v from Cξβ1β2. This
implies that R∪Q is bivalent from Cξβ1.
Using these tools, we can prove the following analogue of Lemma 6.7:
Lemma 7.4. Let C be a configuration and let P be a set of n ≥ 2 processes. If P is bivalent from C, then
there is a P-only execution α and a set Q ⊆ P of at mostwot two processes such that Q is bivalent from Cα
and there exists a 2ℓ-packing π of the remaining processes P −Q in Cα.
Proof. By induction on |P|. The base case is when |P| = 2. Let Q = P and let α be the empty execution.
Since P −Q = ∅, the claim holds.
Now let |P| > 2 and suppose the claim holds for |P|−1. By Lemma 6.6, there exists a P-only execution γ
starting from C and a set Q ⊂ P of two processes that is bivalent from D = Cγ. Pick any process z ∈ P−Q.
Then P − {z} is bivalent from D because Q is bivalent from D.
We construct a sequence of configurations D0, D1, . . . reachable from D, such that, for all i ≥ 0, the
following properties hold:
1. there exists a set of two processes Qi ⊆ P − {z} such that Qi is bivalent from Di,
2. there exists a 2ℓ-packing πi of the remaining processes Ri = (P − {z})−Qi in Di, and
14
3. if Li is the set of all locations that are fully 2ℓ-packed by Ri in Di, then Di+1 is reachable from Di by
a (P − {z})-only execution αi which contains a block multi-assignment βi such that, for each location
r ∈ Li, there are at least ℓ multiple assignments in βi that perform ℓ-buffer-write on r.
By the induction hypothesis applied to D and P − {z}, there is a (P − {z})-only execution η starting
from D and a set Q0 ⊆ (P − {z}) of two processes such that Q0 is bivalent from D0 = Dη and and there
exists a 2ℓ-packing π0 of the remaining processes R0 = (P − {z})−Q0 in D0.
Now suppose that Di is a configuration reachable from D and Qi and Ri are sets of processes that satisfy
all three conditions.
By Lemma 7.3 applied to configuration Di, there is a Qi-only execution ξi such that Ri ∪Qi = P − {z}
is bivalent from Diξiβi, where βi is a block multi-assignment in which ℓ-buffer-write is performed exactly
ℓ times on r, for each location r ∈ Li. Applying the induction hypothesis to Diξiβi and P − {z}, we get
a (P − {z})-only execution ψi leading to a configuration Di+1 = Diξiβiψi, in which there is a set, Qi+1,
of two processes such that Qi+1 is bivalent from Di+1. Additionally, there exists a 2ℓ-packing πi+1 of the
remaining processes Ri+1 = (P − {z})− Qi+1 in Di+1. Note that the execution αi = ξiβiψi contains the
block multi-assignment βi.
Since there are only finitely many locations, there exists 0 ≤ i < j such that Li = Lj, i.e., the set of
fully 2ℓ-packed locations by Ri in Di is the same as the set of fully 2ℓ-packed locations by Rj in Dj . Next,
we insert steps of z that cannot be detected by any process in P − {z}. Consider any {z}-only execution
ζ from Diξi that decides a value v ∈ {0, 1}. If ζ does not contain any ℓ-buffer-write to locations outside
Li, then Diξiζβi is indistinguishable from Diξiβi to processes in P − {z}. Since Diξiβi is bivalent for
P−{z}, there exists a P−{z} execution from Diξiβi and, hence, from Diξiζβi that decides v¯, contradicting
agreement. Thus ζ contains an ℓ-buffer-write to a location not in Li. Let ζ
′ be the longest prefix of ζ that
does not contain an ℓ-buffer-write to a location outside Li. Then, in Diξiζ
′, z is poised to perform a multiple
assignment containing an ℓ-buffer-write to a location outside Li = Lj .
Diξiζ
′βi is indistinguishable from Diξiβi to P −{z}, so the (P −{z})-only execution ψiαi+1 · · ·αj−1 can
be applied from Diξiζ
′βi. Let α = γηα0 · · ·αi−1ξiζ′βiψiαi+1 · · ·αj−1. Every process in P − {z} is in the
same state in Cα as it is in Dj . In particular, Qj ⊆ P − {z} is bivalent from Dj and, hence, from Cα. The
2ℓ-packing πj of Rj in Dj is a 2ℓ-packing of Rj in Cα and Li = Lj is the set of locations that are fully
2ℓ-packed by Rj in Cα. Since z takes no steps after Diξζ′, z covers a location r not in Lj in configurations
Dj and Cα. Since r 6∈ Lj, there is a 2ℓ-packing π′j of Rj in Cα which packs less than 2ℓ processes into r.
Let π be the packing that packs z into location r and packs each process in Rj in the same location as π′j
does. Then π is a 2ℓ-packing of Rj ∪ {z} = P −Qj in Cα.
We can now prove the main theorem.
Theorem 7.5. Consider a memory consisting of ℓ-buffers, in which each process can atomically perform
ℓ-buffer-write to any subset of the ℓ-buffers. Then any obstruction-free binary consensus algorithm for n
processes uses at least ⌈(n− 1)/2ℓ⌉ locations.
Proof. Consider any obstruction-free binary consensus algorithm for n processes. By Lemma 6.4, there exists
an initial configuration from which the set of all n processes, P , is bivalent. Lemma 7.4 implies that there
is a configuration, C, reachable from this initial configuration, a set of two processes Q ⊆ P such that Q
is bivalent from C, and a 2ℓ-packing π of the remaining processes R = P − Q in C. By the pigeonhole
principle, R covers at least ⌈(n− 2)/2ℓ⌉ different locations.
Suppose that R covers exactly ⌈(n − 2)/2ℓ⌉ different locations and ⌈(n − 2)/2ℓ⌉ < ⌈(n − 1)/2ℓ⌉. Then
n− 2 is a multiple of 2ℓ and every location is fully 2ℓ-packed by R. Since Q is bivalent from C, Lemma 7.3
implies that there is a Q-only execution ξ such that some process in Q covers a location that is not fully
2ℓ-packed by R. Hence, there are at least ⌈(n− 2)/2ℓ⌉+ 1 = ⌈(n− 1)/2ℓ⌉ locations.
8 Swap and Read
In this section, we present an anonymous obstruction-free algorithm for solving n-consensus using n − 1
shared memory locations, X1, . . . , Xn−1, which support read and swap. The swap(v) instruction atomically
sets the memory location to have value v and returns the value that it previously contained.
15
Intuitively, values 0, 1, . . . , n− 1 are competing to complete laps. If v gets a substantial lead on all other
values, then the value v is decided. Each process has a local variable ℓv, for each v ∈ {0, 1, . . . , n − 1}, in
which it stores its view of v’s current lap. Initially, these are all 0. If the process has input v, then its first
step is to set ℓv = 1. The process also has n local variables a1, . . . , an−1 and s. In ai, it stores the last value
it read from Xi, for i ∈ {1, . . . , n− 1}, and, in s it stores the value returned by its last swap operation.
When a process performs swap, it includes its process identifier and a strictly increasing sequence number
as part of its argument. Then it is possible to implement a linearizable, obstruction-free scan of the n − 1
shared memory locations using the double collect algorithm [AAD+93]: A process repeatedly collects the
values in all the locations (using read) until it observes two consecutive collects with the same values. In
addition to a process identifier and a sequence number (which we will henceforth ignore), each shared memory
location stores a vector of n components, all of which are initially 0. Likewise, a1, . . . , an−1 and s are initially
n-component vectors of 0’s.
A process begins by performing a scan of all n−1 memory locations. Then, for each value v, it updates its
view, ℓv, of v’s current lap to be the maximum among ℓv, the v’th component of s, and the v’th component
of the vector in each memory location when its last scan was performed. If there is a memory location
that does not contain (ℓ0, ℓ1, . . . , ℓn−1), then the process performs swap((ℓ0, ℓ1, . . . , ℓn−1)) on the first such
location. Now suppose all the memory locations contain (ℓ0, ℓ1, . . . , ℓn−1). If there is a value v such that
ℓv is at least 2 bigger than every other component in this vector, then the process decides v. Otherwise, it
picks the value v with the largest current lap (breaking ties in favour of smaller values) and considers v to
have completed lap ℓv. Then it performs swap((ℓ0, . . . , ℓv−1, ℓv + 1, ℓv+1, . . . , ℓn−1)) on X1. If the process
doesn’t decide, it repeats this sequence of steps.
Algorithm 1 An n-consensus algorithm for a process with input value x
1: ℓx ← 1
2: loop
3: (a1, . . . , an−1)← scan(X1, . . . , Xn−1)
4: for v ∈ {0, 1, . . . , n− 1} do
5: ℓv ← max({ℓv, s[v]} ∪ {aj[v] : 0 ≤ j ≤ n− 1})
6: ℓ∗ ← max{ℓ0, . . . , ℓn−1}
7: v∗ ← min{v : ℓv = ℓ∗}
8: if aj = (ℓ0, . . . , ℓn−1) for all 1 ≤ j ≤ n− 1 then ⊲ v∗ has completed lap ℓ∗
9: if ℓ∗ ≥ ℓv + 2 for all v 6= v∗ then ⊲ v∗ is at least 2 laps ahead of all other values
10: decide v∗ and terminate
11: ℓv∗ ← ℓv∗ + 1 ⊲ value v∗ is now on the next lap
12: j ← min{j : aj 6= (ℓ0, . . . , ℓn−1)}
13: s← swap(Xj , (ℓ0, . . . , ℓn−1))
We now prove that our protocol is correct, i.e. it satisfies validity, agreement, and obstruction-freedom.
Each step in the execution is either a swap performed on some Xj or a scan of X1, . . . , Xn−1. For each
scan, S, by a process p and for each v ∈ {0, . . . , n− 1}, we define ℓv(S) to be the value of p’s local variable
ℓv computed on line 5 following S. Similarly, for each swap U and each v ∈ {0, . . . , n − 1}, if U swaps the
contents of Xj with value (ℓ0, . . . , ℓn−1), then we define ℓv(U) = ℓv.
We begin with an easy observation, which follows from inspection of the code.
Observation 8.1. Let U be a swap by some process p and let S be the last scan that p performed before U .
Then, for each v ∈ {0, . . . , n− 1}, ℓv(U) ≥ ℓv(S). If there exists v ∈ {0, . . . , n− 1} such that ℓv(U) > ℓv(S),
then ℓv(U) = ℓv(S) + 1, ℓv′(S) ≤ ℓv(S) for all v′ 6= v, and S returned the same value, (ℓ0(S), . . . , ℓn−1(S)),
from each shared memory location.
The next lemma follows from Observation 8.1. It says that if there was a scan , S, where value v is on
lap ℓ > 0, i.e. ℓv(S) = ℓ, then there was a scan where v is on lap ℓ − 1 and all the swap objects contained
this information.
Lemma 8.2. Let S be any scan and let v ∈ {0, . . . , n − 1}. If ℓv(S) > 0, then there was a scan, T ,
16
prior to S such that T returned the same value from each shared memory location, ℓv(T ) = ℓv(S) − 1, and
ℓv′(T ) ≤ ℓv(T ), for all v′ 6= v.
Proof. Since each swap object initially contains an n-component vector of 0’s and ℓv(S) > 0, there was swap
U prior to S such that ℓv(U) = ℓv(S). Consider the first such swap. Let p be the process that performed U
and let T be the last scan performed by p before U . By the first part of Observation 8.1, ℓv(U) ≥ ℓv(T ) and,
by definition of U , ℓv(U) > ℓv(T ) (otherwise, there would have been an earlier swap U
′ with ℓv(U
′) = ℓv(T ) =
ℓv(U) = ℓv(S)). By the second part of Observation 8.1, it follows that ℓv(U) = ℓv(T )+ 1, ℓv′(T ) ≤ ℓv(T ) for
all v′ 6= v, and T returns a vector whose components all contain the same pair (ℓ0(T ), . . . , ℓn−1(T )). Since
ℓv(U) = ℓv(S), it follows that ℓv(T ) = ℓv(U)− 1 = ℓv(S)− 1.
The following lemma is key. In particular, it says that if a process considers value v to have completed
lap ℓ as a result of performing a scan S where all the components have the same value, then every process
will think that v is at least on lap ℓ when it performs any scan after S.
Lemma 8.3. Suppose S is a scan that returns the same value from each shared memory location. If T is a
scan performed after S, then, for each v ∈ {0, . . . , n− 1}, ℓv(T ) ≥ ℓv(S).
Proof. Suppose, for a contradiction, that there is a scan T after S such that ℓv(T ) < ℓv(S) for some
v ∈ {0, . . . , n − 1}. Consider the first such scan T . Since the the value of ℓv computed on line 5 is the
maximum of a set that includes ℓv, the value of each local variable ℓv is non-decreasing. Since ℓv(T ) < ℓv(S),
the process, q, that performed T is not the process that performed S.
The value returned by T from each shared memory location is either the value returned by S from that
location or is the argument of a swap performed on that location between S and T . By assumption, the
value returned by S from each shared memory location is (ℓ0(S), . . . , ℓn−1(S)). Since ℓv(T ) < ℓv(S) and the
value of ℓv computed by q on line 5 after T is at least as large as the v’th components of the values returned
by T from each shared memory location, it follows that the value returned by T from each shared memory
location is the argument of a swap performed on that location between S and T .
Partition the swaps that occur between S and T into two sets, W and W ′. For any swap Y performed
by a process p between S and T , Y ∈ W if p’s last scan prior to Y occurred before S. Otherwise Y ∈ W ′.
In particular, if process p performed S, then all of the swaps between S and T performed by p are in W ′.
Each process alternately performs scan and swap. Therefore, if a process performs more than one swap
between S and T , then all of them, except possibly the first, are in W ′. It follows that each swap in W is
by a different process and |W | ≤ n− 1.
If the swaps in W modify fewer than n − 1 different swap objects, then the value returned by T from
some shared memory location is the argument of a swap U ′ ∈W ′. such that ℓv(U ′) < ℓv(S).
Otherwise, the swaps in W modify exactly n − 1 shared memory locations. Then the process, q, that
performed T performed a swap U ∈W . Each swap in W modifies a different location. Therefore, the value
that U returns is either the value returned by S from that memory location or is the argument of a swap
U ′ ∈W ′ (performed at the same location). Since ℓv(T ) < ℓv(S) and the value of ℓv computed by q on line 5
after T is at least as large as the v’th component of the result of every swap that q performed prior to T , it
follows that U returns the argument of a swap U ′ ∈W ′ such that ℓv(U ′) < ℓv(S).
In either case, let p′ be the process that performed U ′ and let T ′ be the last scan that p′ performed prior to
U ′. By definition ofW ′, T ′ occurs between S and U ′ and, hence, before T . By definition of T , ℓv(T
′) ≥ ℓv(S),
for each v ∈ {0, 1, . . . , n− 1}. Therefore, by Observation 8.1, ℓv(U ′) ≥ ℓv(T ′), so ℓv(U ′) ≥ ℓv(S). This is a
contradiction.
The previous lemma allows us to prove that once a value v is at a lap ℓ that is 2 laps ahead of v and
every swap object contains this information, then v will never reach lap ℓ, i.e. v will always be at least one
lap ahead of v.
Lemma 8.4. Suppose S is a scan that returns the same value (ℓ0(S), . . . , ℓn−1(S)) from each shared memory
location and there is some v ∈ {0, . . . , n − 1} such that ℓv(S) ≥ ℓv′(S) + 2 for all v′ 6= v. Then, for every
scan T and every value v′ 6= v, ℓv′(T ) ≤ ℓv′(S) + 1.
Proof. Suppose, for a contradiction, that there is some scan T and some value v′ 6= v such that ℓv′(T ) ≥
ℓv′(S) + 2 > 0. Consider the first such scan. By Lemma 8.2, there was a scan, T
′, prior to T such that T ′
17
returned the same value from each shared memory location, ℓv′(T
′) = ℓv′(T )− 1, and ℓv(T ′) ≤ ℓv′(T ′). By
definition of T , ℓv′(T
′) < ℓv′(S) + 2. Hence, ℓv′(T ) = ℓv′(S) + 2 and ℓv′(T
′) = ℓv′(S) + 1.
Since ℓv′(S) < ℓv′(T
′) and T ′ returned the same value from each shared memory location, the contra-
positive of Lemma 8.3 implies that S was performed before T ′. Since S returned the same value from each
shared memory location, Lemma 8.3 implies that ℓv(T
′) ≥ ℓv(S). By assumption, ℓv(S) ≥ ℓv′(S)+2. Hence,
ℓv(T
′) ≥ ℓv′(S) + 2 = ℓv′(T ′) + 1. This contradicts the fact that ℓv(T ′) ≤ ℓv′(T ′).
We can now prove that the protocol satisfies agreement, validity, and obstruction-free termination.
Lemma 8.5. No two processes decide differently.
Proof. From lines 8–10 of the code, the last step a process performs before deciding value v∗ is a scan , S,
that returns the same value from each shared memory location and such that ℓv∗(S) ≥ ℓv(S) + 2 for all
v 6= v∗. Consider the first such scan . By Lemma 8.3, ℓv∗(T ) ≥ ℓv∗(S) for every scan T performed after S.
By Lemma 8.4, ℓv(T ) ≤ ℓv(S) + 1 for all v 6= v∗. Hence, ℓv∗(T ) > ℓv(T ). It follows that no process ever
decides v 6= v∗.
Lemma 8.6. If every process has input x, then no process decides x′ 6= x.
Proof. Suppose there is a swap, U , such that ℓx′(U) > 0 for some x
′ 6= x. Consider the first such swap.
Let p be the process that performed U and let S be the last scan that p performed before U . Since each
shared memory location initially stores an n-component vector of 0’s, ℓx′(S) = 0 < ℓx′(U). By the second
part of Observation 8.1, ℓx(S) ≤ ℓx′(S) = 0. Since p has input x, it set ℓx = 1 on line 1. From the code, ℓx
is non-decreasing, so ℓx ≥ 1 whenever p performs line 5. By definition, ℓx(S) ≥ 1. This is a contradiction.
Thus, ℓx′(U) = 0 for all swaps U . Since no process has input x
′, no process set ℓx′ = 1 on line 1. It follows
that ℓx′(S) = 0 for all scans S, so no process decides x
′ 6= x.
Lemma 8.7. Every process decides after performing at most 3n− 2 scans in a solo execution.
Proof. Let p be any process and consider the first scan S performed by p in its solo execution. After
performing at most n − 1 swaps, all with value (ℓ0(S), . . . , ℓn−1(S)), p will perform a scan that returns
(ℓ0(S), . . . , ℓn−1(S)) from each shared memory location. Let v
∗ = min{v : ℓv(S) ≥ ℓv′(S) for all v′ 6= v}.
If ℓv∗(S) ≥ ℓv(S) + 2 for all v 6= v∗, then p decides v∗. Otherwise, p performs n − 1 swaps, all with value
(ℓ′0, . . . , ℓ
′
n−1), where ℓ
′
v∗ = ℓv∗(S) + 1 and ℓ
′
v = ℓv(S), for v 6= v∗. Then it performs a scan that returns a
vector whose components all contain (ℓ′0, . . . , ℓ
′
n−1). If ℓ
′
v∗ ≥ ℓ′v + 2 for all v 6= v∗, then p decides v∗. If not,
then p performs an additional n − 1 swaps, all with value (ℓ′′0 , . . . , ℓ′′n−1), where ℓ′′v∗ = ℓ′v∗ + 1 = ℓv∗(S) + 2
and ℓ′′v = ℓ
′
v = ℓv(S) for v 6= v∗. Finally, p performs a scan that returns a vector whose components all
contain (ℓ′′0 , . . . , ℓ
′′
n−1) and decides v
∗. Since p performs at most 3(n−1) swaps and each swap is immediately
followed by a scan, this amounts to 3n− 2 scans, including the first scan, S.
The preceding lemmas immediately yield the following theorem.
Theorem 8.8. There is an anonymous, obstruction-free protocol for solving consensus among n processes
that uses only n− 1 memory locations supporting read and swap.
In [FHS98], there is a proof that Ω(
√
n) shared memory locations are necessary to solve obstruction-free
consensus when the system only supports swap and read instructions.
9 Test-and-Set and Read
Consider a system that supports only test-and-set() and read(). If there are only 2 processes, then it is
possible to solve wait-free consensus using a single memory location. However, we claim that any algorithm
for solving obstruction-free binary consensus among n ≥ 3 processes must use an unbounded number of
memory locations. The key is to prove the following analogue of Lemma 6.5.
Lemma 9.1. Let P be a set of at least 3 processes and let C be a configuration. If P is bivalent from C,
then, for every k ≥ 0, there exists a P-only execution αk from C such that P is bivalent from Cαk and at
least k locations have been set to 1 in Cαk.
18
Proof. By induction on k. The base case, k = 0, holds when α0 is the the empty execution. Given αk, for
some k ≥ 0, we show how to construct αk+1. By Lemma 6.6, there is a P-only execution ξ from Cαk and
two processes, p0, p1 ∈ P such that pi decides i in its terminating solo execution, γi, from Cαkξ, for i = 0, 1.
Let Lk be the set of at least k memory locations that have been set to 1 in Cαkξ.
Let z ∈ P − {p0, p1}. Suppose that z decides v ∈ {0, 1} in its solo execution δ from Cαkξ. If z does
not perform test-and-set() on a location outside Lk during δ, then Cαkξδ is indistinguishable from Cαkξ to
{p0, p1}, so γv¯ can be applied starting from Cαkξδ, violating agreement. Thus z performs test-and-set() on
a location outside Lk during δ. Let β be the shortest prefix of δ in which z performs test-and-set() on a
location outside Lk and let r be the location outside Lk on which z performs test-and-set() during β.
If {p0, p1} is bivalent from Cαkξβ, then αk+1 = αkξβ satisfies the claim for k + 1, since β sets location
r 6∈ Lk to 1. So, without loss of generality, suppose that {p0, p1} is 0-univalent from Cαkξβ. Let ψ be the
longest prefix of γ1 such that p0 decides 0 from Cαkξψβ. Note that ψ 6= γ1, since 1 is decided in γ1. Let
ψ′ be the first step in γ1 following ψ. If ψ
′ is a read or a test-and-set() on a location in Lk ∪ {r}, then
Cαkξψψ
′β is indistinguishable from Cαkξψβ to p0. This is impossible, since p0 decides 0 from Cαkξψβ and
p0 decides 1 from Cαkξψψ
′β. Thus, ψ′ is a test-and-set() on a location outside Lk ∪ {r}. Hence Cαkξψψ′β
is indistinguishable from Cαkξψβψ
′ to all processes. In particular, p0 decides 1 from Cαkξψβψ
′.
Since p0 decides 0 from Cαkξψβ and decides 1 from Cαkξψβψ
′, it follows that {p0, p1} is bivalent from
Cαkξψβ. Furthermore, β sets location r 6∈ Lk to 1. Thus αk+1 = αkξψβ satisfies the claim for k + 1.
By Lemma 6.4, there is an initial configuration from which the set of all processes in the system is
bivalent. Then it follows from Lemma 9.1 that any binary consensus algorithm for n ≥ 3 processes uses an
unbounded number of locations.
Theorem 9.2. For n ≥ 3, it is not possible to solve n-consensus using a bounded number of memory
locations supporting only read() and test-and-set().
There is an algorithm for obstruction-free binary consensus that uses an unbounded number of shared
memory locations that support only read() and write(1) [GR05]. All locations are initially 0. The idea is
to simulate a counter using an unbounded number of binary registers and then to run the racing counters
algorithm presented in Lemma 3.1. In this algorithm, there are two unbounded tracks on which processes
race, one for preference 0 and one for preference 1. Each track consists of an unbounded sequence of shared
memory locations. To indicate progress, a process performs write(1) to the location on its preferred track
from which it last read 0. Since the count on each track does not decrease, a process can perform a scan
using the double collect algorithm [AAD+93]. It is not necessary to read all the locations in a track to
determine the count it represents. It suffices to read from the location on the track from which it last read
0, continuing to read from the subsequent locations on the track until it reads another 0. A process changes
its preference if it sees that the number of 1’s on its preferred track is less than the number of 1’s on the
other track. Once a process sees that its preferred track is at least 2 ahead of the other track, it decides its
current preference.
It is possible to generalize this algorithm to solve n-valued consensus by having n tracks, each consisting
of an unbounded sequence of shared memory locations. Since test-and-set() can simulate write(1) by ignoring
the value returned, we get the following result.
Theorem 9.3. It is possible to solve n-consensus using an unbounded number of memory locations supporting
only read() and either write(1) or test-and-set().
Now, suppose we can also perform write(0) or reset() a memory location from 1 to 0. There is an existing
binary consensus algorithm that uses 2n locations, each storing a single bit [Bow11]. Then, it is possible
to solve n-consensus using O(n logn) locations by applying Lemma 5.2. There is a slight subtlety, since
the algorithm in the proof of Lemma 5.2 uses two designated locations for each round, to which values in
{0, . . . , n−1} can be written. In place of each designated location, it is possible to use a sequence of n binary
locations, all initialized to 0. Instead of performing write(x) on the designated location, a process performs
write(1) to the (x+ 1)’st binary location. To find one of the values that has been written to the designated
location, a process reads the sequence of binary locations until it sees a 1.
Theorem 9.4. It is possible to solve n-consensus using O(n logn) memory locations supporting only read(),
either write(1) or test-and-set(), and either write(0) or reset().
19
10 Conclusions and Future Work
In this paper, we defined a hierarchy based on the space complexity of solving obstruction-free consensus. We
used consensus because it is a well-studied, general problem that seems to capture a fundamental difficulty
of multiprocessor synchronization. Moreover, consensus is universal: any sequentially defined object can be
implemented in a wait-free way using only consensus objects and registers [Her91].
We did not address the issue of universality within our hierarchy. One history object can be used to
implement any sequentially defined object. Consequently, it may make sense to consider defining a hierarchy
on sets of instructions based on implementing a history object, a compare-and-swap object, or a repeated
consensus object shared by n processes. However, the number of locations required for solving n-consensus is
the same as the number of locations required for obstruction-free implementations of these long-lived objects
for many of the instruction sets that we considered.
A truly accurate complexity-based hierarchy would have to take step complexity into consideration.
Exploring this may be an important future direction. Also, it is standard to assume that memory locations
have unbounded size, in order to focus solely on the challenges of synchronization. For a hierarchy to be
truly practical, however, we might need to consider the size of the locations used by an algorithm.
There are several other interesting open problems. To the best of our knowledge, all existing space lower
bounds rely on a combination of covering and indistinguishability arguments. However, when the covering
processes apply swap(x), as opposed to write(x), they can observe differences between executions, so they
can no longer be reused and still maintain indistinguishability. This means that getting a larger space lower
bound for {swap(x), read()} would most likely require new techniques. An algorithm that uses less than
n − 2 shared memory locations would be even more surprising, as the processes would have to modify the
sequence of memory locations they access based on the values they receive from swaps, to circumvent the
argument from [Zhu16]. The authors are unaware of any such algorithm.
Getting an ω(
√
n) space lower bound for solving consensus in a system that supports test-and-set(), reset()
and read() is also interesting. Using test-and-set(), processes can observe difference between executions as
they can using swap(x). However, each location can only store a single bit. This restriction could potentially
help in proving a lower bound.
To prove the space lower bound of ⌈n−1
ℓ
⌉ for ℓ-buffers, we extended the technique of [Zhu16]. The
n − 1 lower bound of [Zhu16] has since been improved to n by [EGZ18]. Hence, we expect that the new
simulation-based technique used there can also be extended to prove a tight space lower bound of ⌈n
ℓ
⌉.
We conjecture that, for sets of instructions, I, which contain only read(), write(x), and either increment()
or fetch-and-increment(), SP(I, n) ∈ Θ(logn). Similarly, we conjecture, for I = {read(),write(0),write(1)},
SP(I, n) ∈ Θ(n logn). Proving these conjectures is likely to require techniques that depend on the number
of input values, such as in the lower bound for m-valued adopt-commit objects in [AE14].
We would like to understand the properties of sets of instructions at certain levels in the hierarchy. For
instance, what properties enable a collection of instructions to solve n-consensus using a single location? Is
there an interesting characterization of the sets of instructions I for which SP(I, n) is constant? How do
subsets of a set of instructions relate to one another in terms of their locations in the hierarchy? Alternatively,
what combinations of sets of instructions decrease the amount of space needed to solve consensus? For
example, using only read(), write(x), and either increment() or decrement(), more than one memory location
is needed to solve binary consensus. But with both increment() and decrement(), a single location suffices.
Are there general properties governing these relationships?
11 Acknowledgments
Support is gratefully acknowledged from the Natural Science and Engineering Research Council of Canada,
the National Science Foundation under grants CCF-1217921, CCF-1301926, and IIS-1447786, the Depart-
ment of Energy under grant ER26116/DE-SC0008923, and Oracle and Intel corporations.
The authors would like to thank Michael Coulombe, Dan Alistarh, Yehuda Afek, Eli Gafni and Philipp
Woelfel for helpful conversations and feedback.
20
References
[AAC09] James Aspnes, Hagit Attiya, and Keren Censor. Max registers, counters, and monotone circuits.
In Proceedings of the 28th ACM Symposium on Principles of Distributed Computing, PODC ’09,
pages 36–45, 2009.
[AAD+93] Yehuda Afek, Hagit Attiya, Danny Dolev, Eli Gafni, Michael Merritt, and Nir Shavit. Atomic
snapshots of shared memory. Journal of the ACM, 40(4):873–890, 1993.
[AE14] James Aspnes and Faith Ellen. Tight bounds for adopt-commit objects. Theory of Computing
Systems, 55(3):451–474, 2014.
[AH90] James Aspnes and Maurice Herlihy. Fast randomized consensus using shared memory. Journal
of Algorithms, 11(3):441–461, 1990.
[AW04] Hagit Attiya and Jennifer Welch. Distributed computing: fundamentals, simulations, and ad-
vanced topics, volume 19. John Wiley & Sons, 2004.
[Bow11] Jack R. Bowman. Obstruction-free snapshot, obstruction-free consensus, and fetch-and-add
modulo k. Technical Report TR2011-681, Computer Science Department, Dartmouth College,
2011. http://www.cs.dartmouth.edu/reports/TR2011-681.pdf.
[BRS15] Zohir Bouzid, Michel Raynal, and Pierre Sutra. Anonymous obstruction-free (n, k)-set agreement
with n-k+1 atomic read/write registers. Distributed Computing, pages 1–19, 2015.
[EGSZ16] Faith Ellen, Rati Gelashvili, Nir Shavit, and Leqi Zhu. A complexity-based hierarchy for multi-
processor synchronization:[extended abstract]. In Proceedings of the 35th ACM Symposium on
Principles of Distributed Computing, PODC ’16, pages 289–298, 2016.
[EGZ18] Faith Ellen, Rati Gelashvili, and Leqi Zhu. Revisionist simulations: A new approach to proving
space lower bounds. In Proceedings of the 37th ACM symposium on Principles of Distributed
Computing, PODC ’18, 2018.
[FHS98] Faith Ellen Fich, Maurice Herlihy, and Nir Shavit. On the space complexity of randomized
synchronization. Journal of the ACM, 45(5):843–862, 1998.
[FLMS05] Faith Ellen Fich, Victor Luchangco, Mark Moir, and Nir Shavit. Obstruction-free algorithms
can be practically wait-free. In Proceedings of the 19th International Symposium on Distributed
Computing, DISC ’05, pages 78–92, 2005.
[Gel15] Rati Gelashvili. On the optimal space complexity of consensus for anonymous processes. In
Proceedings of the 29th International Symposium on Distributed Computing, DISC ’15, pages
452–466, 2015.
[GHHW13] George Giakkoupis, Maryam Helmi, Lisa Higham, and Philipp Woelfel. An O(√n) space bound
for obstruction-free leader election. In Proceedings of the 27th International Symposium on
Distributed Computing, DISC ’13, pages 46–60, 2013.
[GR05] Rachid Guerraoui and Eric Ruppert. What can be implemented anonymously? In Proceedings
of the 19th International Symposium on Distributed Computing, DISC ’05, pages 244–259, 2005.
[Her91] Maurice Herlihy. Wait-free synchronization. ACM Transactions on Programming Languages and
Systems, 13(1):124–149, 1991.
[HR00] Maurice Herlihy and Eric Ruppert. On the existence of booster types. In Proceedings of the 41st
IEEE Symposium on Foundations of Computer Science, FOCS ’00, pages 653–663, 2000.
[HS12] Maurice Herlihy and Nir Shavit. The Art of Multiprocessor Programming. Morgan Kaufmann,
2012.
21
[Int12] Intel. Transactional Synchronization in Haswell, 2012. http://software.intel.com/en-
us/blogs/2012/02/07/transactional-synchronization-in-haswell.
[Jay93] Prasad Jayanti. On the robustness of herlihy’s hierarchy. In Proceedings of the 12th ACM
Symposium on Principles of Distributed Computing, PODC ’93, pages 145–157, 1993.
[LH00] Wai-Kau Lo and Vassos Hadzilacos. All of us are smarter than any of us: Nondeterministic
wait-free hierarchies are not robust. SIAM Journal on Computing, 30(3):689–728, 2000.
[MPR18] Achour Moste´faoui, Matthieu Perrin, and Michel Raynal. A simple object that spans the whole
consensus hierarchy. arXiv preprint arXiv:1802.00678, 2018.
[Ray12] Michel Raynal. Concurrent programming: algorithms, principles, and foundations. Springer
Science & Business Media, 2012.
[Rup00] Eric Ruppert. Determining consensus numbers. SIAM Journal on Computing, 30(4):1156–1168,
2000.
[Sch97] Eric Schenk. The consensus hierarchy is not robust. In Proceedings of the 16th ACM Symposium
on Principles of Distributed Computing, PODC ’97, page 279, 1997.
[Tau06] Gadi Taubenfeld. Synchronization algorithms and concurrent programming. Pearson Education,
2006.
[Zhu15] Leqi Zhu. Brief announcement: Tight space bounds for memoryless anonymous consensus. In
Proceedings of the 29th International Symposium on Distributed Computing, DISC ’15, page 665,
2015.
[Zhu16] Leqi Zhu. A tight space bound for consensus. In Proceedings of the 48th ACM Symposium on
Theory of Computing, STOC ’16, pages 345–350, 2016.
22
