Space Complexity of Fault Tolerant Register Emulations by Chockler, Gregory & Spiegelman, Alexander
Space Complexity of Fault Tolerant Register Emulations
Gregory Chockler
CS Department
Royal Holloway, University of London, UK
gregory.chockler@rhul.ac.uk
Alexander Spiegelman∗
Viterbi EE Department
Technion, Haifa, Israel
sashas@tx.technion.ac.il
Abstract
Driven by the rising popularity of cloud storage, the costs associated with implementing reliable
storage services from a collection of fault-prone servers have recently become an actively studied
question. The well-known ABD result shows that an f -tolerant register can be emulated using a
collection of 2f + 1 fault-prone servers each storing a single read-modify-write object type, which
is known to be optimal. In this paper we generalize this bound: we investigate the inherent space
complexity of emulating reliable multi-writer registers as a fucntion of the type of the base objects
exposed by the underlying servers, the number of writers to the emulated register, the number of
available servers, and the failure threshold.
We establish a sharp separation between registers, and both max-registers (the base object
types assumed by ABD) and CAS in terms of the resources (i.e., the number of base objects of
the respective types) required to support the emulation; we show that no such separation exists
between max-registers and CAS. Our main technical contribution is lower and upper bounds on
the resources required in case the underlying base objects are fault-prone read/write registers.
We show that the number of required registers is directly proportional to the number of writers
and inversely proportional to the number of servers.
∗Alexander Spiegelman is grateful to the Azrieli Foundation for the award of an Azrieli Fellowship.
ar
X
iv
:1
70
5.
07
21
2v
1 
 [c
s.D
C]
  1
9 M
ay
 20
17
1 Introduction
Reliable storage emulations seek to construct fault-tolerant shared objects, such as read/write regis-
ters, using a collection of base objects hosted on failure-prone servers. Such emulations are core
enablers for many modern storage services and applications, including cloud-based online data
stores [14, 32, 31, 25, 30] and Storage-as-a-Service offerings [33, 36, 16, 38].
Most existing storage emulation algorithms are constructed from storage services capable of
supporting custom-built read-modify-write (RMW) primitives [5, 18, 18, 22, 15, 21, 3, 34, 29], and
perhaps the most famous one is ABD [5]. This algorithm emulates an f -tolerant atomic wait-free
register, accessed by an unbounded number of processes (readers and writers), on top of 2f + 1
servers, each of which stores a single RMW object. Since f -tolerant register emulation is impossible
with less than 2f + 1 servers [8, 28], the ABD algorithm’s space complexity is optimal.
However, support for atomic RMW is not always available: the operations exposed by network-
attached disks are typically limited to basic read/write capabilities, and the interfaces exposed by
cloud storage services sometimes augment this with simple conditional update primitives similar to
Compare-and-Swap (CAS). A natural question that arises is therefore how the ABD results generalize
when only weaker primitives (e.g., read/write registers) are available. More specifically, we are
interested whether reliable storage emulations are possible with weaker primitives, and if so, what
their space complexity is, and in particular, whether it depends on the number of writers and the
number of servers.
To answer these questions, we consider a collection of n fault-prone servers, each of which stores
base objects supporting the given primitive. The failure granularity is servers, meaning that a
server crash causes all base objects it stores to crash as well. We study three primitives: read/write
register, max-register [4], and CAS. For each primitive, we are interested in the number of base
objects required to emulate an f -tolerant register for k writers using n servers.
To strengthen our result, we prove the lower bound under weak liveness and safety guarantees,
namely, obstruction freedom and write sequential safety (WS-Safety). The latter is a weak gener-
alization for Lamport’s notion of safety [27] to multi-writer registers, which we define in Section 2.
Since atomicity usually requires readers to write, which may induce a dependency on the number of
readers, we consider regularity for our upper bound; we define in Section 2 write sequential regu-
larity (WS-Regularity), which is a weaker form of multi-writer regularity defined in [34]. The lower
bound proved in [8, 28] on the number of servers required for f -tolerant register emulations can be
easily generalized for WS-Safe obstruction-free emulations. Therefore, we assume that n ≥ 2f + 1
throughout the paper.
Table 1 summarizes our results. Interestingly, even though both registers and max-registers have
the lowest consensus number of 1 in Herlihy’s hierarchy [23], our results show they are clearly sep-
arated with respect to their power to support a reliable multi-writer register in a space-efficient
fashion. On the other hand, no such separation exists between CAS, which has an infinite consensus
number, and max-register. As an aside, we note that our classification has implications for the stan-
dard shared memory model (without base object failures); for example, it implies that a max-register
for k writers cannot be implemented from less than k read/write registers (proven in Theorem 2).
Results. Despite the fact that the original ABD emulation [5] assumes a general RMW base
object on every server, we observe that the code executed by each server in the multi-writer ABD
protocol [22, 34, 29] can be encapsulated into the write-max (for handling update messages) and
read-max (for handling read messages) primitives of max-register. Therefore, the upper bound of
2f + 1 applies to max-registers as well. In Appendix B we show how to emulate a max-register from
a single CAS in a wait-free manner. Thus, the upper bound for max-register also applies to CAS.
Our main technical contribution is a lower bound on the number of read/write registers required
to emulate an f -tolerant WS-Safe obstruction-free register for k clients using n servers. While the
1
Table 1: The number of base objects used by f -tolerant register emulations with k writers and
n > 2f servers.
Base object
Lower bound Upper bound
(WS-Safe, obstruction-free) (WS-Regular, wait-free)
max-register 2f + 1 2f + 1
CAS 2f + 1 2f + 1
read/write register kf +
⌈
k
n−(f+1)
f
⌉
(f + 1) kf +
⌈
k⌊
n−(f+1)
f
⌋⌉(f + 1)
ABD [5] space complexity does not depend on the number of writers or the number of servers, we
show in Section 3 (Theorem 1) that when servers support only read/write registers, the lower bound
increases linearly with the number of writers and decreases (up to a certain point) with the number of
available servers. In particular, our lower bound implies that at least kf + f + 1 registers are needed
regardless of the number of available servers. We exploit asynchrony to show that an emulated write
must complete even if it leaves f pending writes on base registers, forcing the next writer to use a
different set of registers, even in a write-sequential run.
In Section 3, we present a new upper bound construction that closely matches our lower bound
(Theorem 3). Note that the two bounds are closely aligned, and in particular, coincide in the two
important cases of n = 2f+1 and n ≥ kf+f+1 where they are equal to kf+k(f+1) and kf+f+1
respectively. An interesting open question is to close the remaining small gap.
Another open question is whether our lower bound is tight for stronger regularity definitions [35].
In the special case of n = 2f +1 servers and k writers, a matching upper bound of (2f +1)k registers
can be achieved by simply having each server implement a single k-writer max-register from k base
registers [4]. The question of the general case of n ≥ 2f + 1 remains open.
In Appendix C, we show the following three additional results implied by an extended variant of
our main lower bound construction: (1) a lower bound of k registers per server for n = 2f + 1 (The-
orem 6); (2) a lower bound on the number of servers when the maximum number of registers stored
on each server is bounded by a known constant (Theorem 7); and (3) impossibility of constructing
fault-tolerant multi-writer register emulations adaptive to point contention [1, 7] (Theorem 8).
Related work. The space complexity of fault-tolerant register emulations has been explored in a
number of prior works. Aguilera et al. [2] show that certain types of multi-writer registers cannot
be reliably emulated from a fixed number of fault-prone ones if the number of writers is not a priori
known. They however, do not provide precise bounds on the number of base registers as a function
of the number of writers, the number of servers, and the failure threshold as we do in this paper. The
space complexity of reliable register emulations in terms of the amount of data stored on fault-prone
RMW servers was studied in [19, 13], and more recently in [37, 12, 11]. Since we are only interested
in the number of stored registers and not their sizes, these results are orthogonal to ours.
Basescu et al. [9] describe several fault-tolerant multi-writer register emulations from a collection
of fault-prone read/write data stores. Their algorithms incorporate a garbage-collection mechanism
that ensures that the storage cost is adaptive to the write concurrency, provided that the underlying
servers can be accessed in a synchronous fashion. Our results show that asynchrony has a profound
impact on storage consumption by exhibiting a sequential failure-free run where the number of
registers that need to be stored grows linearly with the number of writers.
The proof of our main result (see Lemma 1) further extends the adversarial framework of [37] to
exploit the notion of register covering (originally due to [10]) extended to fault-prone base registers
as in [2]. Covering arguments have been successfully applied to proving numerous space lower bound
results in the literature (see [6] for a survey) including the recent tight bounds for obstruction-free
2
consensus [20, 39], which are at the heart of the space hierarchy of [17].
2 Informal Model
In this section, we will introduce basic premises of our system model in informal terms. The formal
model can be found in Appendix A.
An asynchronous fault-prone shared memory system [26] consists of a set of shared base objects
B = {b1, b2, . . . } accessed by clients in the set C = {c1, c2, . . . }. We extend the model by mapping
objects to a set S = {s1, s2, . . . } of servers via a function δ : B → S. We denote n , |S|. For
B ⊆ B, δ(B) denotes the image of B, i.e., δ(B) = {δ(b) : b ∈ B}. Conversely, for S ⊆ S,
δ−1(S) = {b : δ(b) ∈ S}. Both servers and clients can fail by crashing. A crash of a server cause all
objects mapped to that server to instantaneously crash1.
We study algorithms emulating reliable multi-writer/multi-reader (MWMR) registers to a set of
clients. Our focus will be on a register supporting an a priori fixed and known set of k > 0 writers,
to which we will henceforth refer simply as k-register. Clients interact with the emulated register
via high-level read and write operations. To distinguish the high-level emulated reads and writes
from low-level base object invocations, we refer to the former as read and write. We say that
high-level operations are invoked and return whereas low-level operations are triggered and respond.
A high-level operation consists of a series of trigger and respond actions on base objects, starting
with the operation’s invocation and ending with its return. Since base objects are crash-prone, we
assume that the clients can trigger several low-level operations without waiting for the previously
triggered operations to respond.
An emulation algorithm A defines the behaviour of clients as deterministic state machines where
state transitions are associated with actions, such as trigger/response of low-level operations. A
configuration is a mapping to states from system components, i.e., clients and base objects. An
initial configuration is one where all components are in their initial states.
A run of A is a (finite or infinite) sequence of alternating configurations and actions, beginning
with some initial configuration, such that configuration transitions occur according to A. We will
henceforth refer to an A’s transition occurring in a run (i.e., a triple consisting of consecutive con-
figuration, action, and configuration) simply as a step. We use the notion of time t during a run r
to refer to the configuration reached after t steps in r. A run is write-only if it has no invocations
of high-level read operations. A run r is write-sequential if no two high-level writes are concurrent
in r.
We say that a base object, client, or server is faulty in a run r if it fails at some time in r, and
correct otherwise. A run is fair if (1) for every low-level operation triggered by a correct client on a
correct base object there is eventually a matching response, and (2) every correct client gets infinitely
many opportunities to both trigger low-level operations and execute return actions. We say that a
low-lever operation on a base object is pending in run r if it was triggered but has no matching
response in r. We assume that the base objects are atomic [24] (see Section A.3 of Appendix A).
Properties of emulation algorithms. Let A be a fault-tolerant k-register emulation for some
k > 0. We now give informal definitions of safety, liveness, fault-tolerance, and space consumption
properties of A. The formal definitions can be found in Appendix A.
Write-Sequential Regularity (WS-Regular): A is write-sequential regular (WS-Regular) if for all its
write-sequential runs r, for each high-level read operation Rd, there exists a linearization of the
sub-sequence of r consisting of Rd and all the high-level writes in r 2.
Write-Sequential Safety (WS-Safe): Similar to WS-Regular, but only required to hold for high-level
reads that are not concurrent with any high-level writes.
1Note that the original fault-prone shared model [26] is a specific case of our model when δ is an injective function.
2Note that this definition is a generalization for Lamport’s notion of regularity to multi-writer register that coincide
with it in case of a single writer, but is weaker than multi-writer regularity generalizations defined in [34].
3
Wait Freedom: A is wait-free if it guarantees that every high-level operation invoked by a correct
client eventually returns.
Obstruction Freedom: A is obstruction-free if every high-level operation invoked by a correct client
that is not concurrent with any other high-level operation by a correct client eventually returns.
f -tolerance: A is f -tolerant if it remains correct (in the sense of its safety and liveness properties)
as long as at most f servers crash for a fixed f > 0.
Resource Complexity: The resource consumption of A in a run r is the number of base objects used
by A in r. The resource complexity [26] of A is the maximum resource consumption of A in all its
runs.
3 Resource Complexity of Write-Sequential k-register Emulation
In Section 3.1 we give an overview and intuition for our lower bound, and in Section 3.2 we prove it.
In Section 3.3 we present a closely matching upper bound algorithm.
3.1 Lower bound overview
We prove that any f -tolerant emulation of an obstruction-free WS-Safe k-register from of a collection
of MWMR atomic registers stored on a collection S of crash-prone servers has resource complexity
of at least kf +
⌈
kf
|S|−(f+1)
⌉
(f + 1).
Our proof exploits the fact that the environment is allowed to prevent a pending low-level write
from taking effect for arbitrarily long [2]. As a result, a client executing a high-level write operation
cannot reliably store the requested value in a base register that has a pending write as this write
may take effect at a later time thus erasing the stored value. At the same time, the client cannot
wait for all base registers on which it triggers low-level operations to respond, since up to f of
them may reside on faulty servers. It therefore must be able to complete a high-level write without
waiting for responses from up to f registers. Consequently, the next high-level write (by a different
client) cannot reliably use these registers (as they might have outstanding low-level writes), and is
therefore forced to use additional registers thus causing the total number of registers grow with each
subsequent write.
In our main lemma (Lemma 1), we formalize this intuition as follows: Starting from a run r0
consisting of an initial configuration, we build a sequence of consecutive extensions r1, . . . , rk so that
ri is obtained from ri−1 by having a new client invoke a high-level write Wi of some (not previously
used) value. We then let the environment behave in an adversarial fashion (Definition 3) by blocking
the responses from the writes triggered on at most f base registers as well as the prior pending
low-level writes. In Lemma 3, we show that Wi must terminate without waiting for these responses
to arrive. Furthermore, in Lemma 4, we show that Wi must invoke low-level writes on at least 2f +1
base registers (residing on ≥ 2f + 1 different servers) that do not have any prior pending writes.
This, combined with Lemma 3, implies that by the time Wi completes, there are at least f more
registers on at least f servers with pending writes after Wi completes. Thus, by the time the kth
high-level write completes, the total number of covered registers is at least kf (see Lemma 1(a)).
To obtain a stronger bound, our construction is parameterized by an arbitrary subset F of servers
such that |F | = f + 1. We show that the extra storage available on these servers cannot in fact, be
utilized by an emulation (see Lemma 1(b)) forcing it to use at least kf registers on the remaining
S \ F servers to accommodate the same number k of writers. We use this result in Theorem 1, to
show that the number of base registers required for the emulation is a function of k and |S|.
3.2 Lower Bounds
For any time t (following the tth action) in a run r of the emulation algorithm we define the following:
4
• Covering write: Let w be a low-level write triggered on a base register b at times ≤ t. We will
refer to w as covering at t, and to b as being covered by w at t if w is pending at t.
• C(t) ⊆ C: the set of all clients that have completed a high-level write operation at times ≤ t.
• Cov(t) ⊆ B: the set of all base registers being covered by some low-level write at time t.
We first prove the following key lemma:
Lemma 1. For all k > 0, f > 0, let A be an f -tolerant algorithm that emulates a WS-Safe
obstruction-free k-register using a collection S of servers storing a collection B of wait-free MWMR
atomic registers. Then, for every F ⊆ S such that |F | = f + 1, there exist k failure-free runs ri,
0 ≤ i ≤ k, of A such that (1) r0 is a run consisting of an initial configuration and t0 = 0 steps, and
(2) for all i ∈ [k], ri is a write-only sequential extension of ri−1 ending at time ti > 0 that consists
of i complete high-level writes of i distinct values v1, . . . , vi by i distinct clients c1, . . . , ci such that:
a) |Cov(ti)| ≥ if b) δ(Cov(ti)) ∩ F = ∅
Fix arbitrary k > 0, f > 0, and a set F of servers such that |F | = f+1. We proceed by induction
on i, 0 ≤ i ≤ k. Base: Trivially holds for the run r0 of A consisting of t0 = 0 steps. Step: Assume
that ri−1 exists for all i ∈ [k − 1]. We show how ri−1 can be extended up to time ti > ti−1 so that
the lemma holds for the resulting run ri. For the remainder of the Lemma 1’s proof, we will assume
without loss of generality that every low-level write operation is linearized simultaneously with its
respond step. Formally:
Assumption 1 (Write Linearization). For every extension r of ri−1 and a base object b ∈ B, let Lr|b
be a linearization of r|b. Then, Lr|b does not include any low-level write operations in pending(r|b),
and for any two low-level writes w1, w2 ∈ complete(r|b) such that respond(w1) ≺r|b respond(w2),
w1 ≺Lr|b w2.
Note that the above implies that no low-level write w that is covering some register b at a time
t in r will be observed by any low-level reads from b as having taken effect until after w’s respond
event occurs.
We proceed by introducing the following notation:
Definition 1. Let r be an extension of ri−1. For all times t ≥ ti−1 in r, let
1. Tri(t): the set of all base registers which have a low-level write triggered on between ti−1 and t.
2. Rri(t) ⊆ Tri(t) be the set of all base registers which had a low-level write triggered on and
responded (took effect) between ti−1 and t.
3. Covi(t) = Cov(t)\Cov(ti−1) be the set of all base registers that have been newly covered between
ti−1 and t. Note that Covi(t) ⊆ Tri(t).
4. Qi(t) ⊆ S be the set of all servers such that Qi(t) = δ(Covi(t)) \F if |δ(Covi(t)) \F | ≤ f , and
Qi(t) = Qi(t− 1), otherwise.
5. Fi(t) , {s ∈ F | δ−1({s}) ∩ Rri(t) 6= ∅}, i.e., Fi(t) is the set of all servers in F having a
register that responded to a low-level write invoked after ti−1.
6. Mi(t) , δ(Covi(t)) ∩ (F \ Fi(t)), i.e., Mi(t) is the set of all servers in F with at least one
register covered by a low-level write invoked after ti−1 and without registers that have responded
to the low-level writes invoked after ti−1.
5
7. Gi(t) ⊆ S be the set of all servers such that Gi(t) = Mi(t) if |Qi(t)| < |Fi(t)|, and Gi(t) = ∅,
otherwise.
Below we will introduce the adversary Adi, which causes A to gradually increase the number of
base registers covered after ti−1 by delaying the respond actions of some of the previously triggered
low-level writes.
Definition 2 (Blocked Writes). Let r be an extension of ri−1. For all times t ≥ ti−1 in r, let
BlockedWritesi(t) be the set of all low-level covering writes w satisfying either one of the following
two conditions:
1. w was triggered by a client in C(ti−1), or
2. w was triggered on a base register in δ−1(Qi(t) ∪Gi(t)).
We say that a pending low-level write w is blocked in an extension r of ri−1 if there exists a time
t ≥ ti−1 such that for all t′ > t in r, op ∈ BlockedWritesi(t′). The following definition specifies the
set of the environment behaviours that are allowed by Adi in all extensions of ri−1:
Definition 3 (Adi). For an extension r of ri−1 we say that the environment behaves like Adi after
ri−1 in r if the following holds:
1. There are no failures after time ti−1 in r.
2. For all t ≥ ti−1 in r, if a low-level write w ∈ BlockedWritesi(t), then w does not respond at t.
3. If r is infinite then:
(a) Every pending low-level read or write that is not blocked in r eventually responds.
(b) Every trigger or return action that is ready to be executed at a client c in r is eventually
executed.
For a finite extension r of ri−1, we will write 〈r,Adi〉 to denote the set of all extensions of r in
which the environment behaves like Adi after ri−1; and we will write 〈r,Adi, t〉 to denote the subset of
〈r,Adi〉 consisting of all runs having exactly t steps. For X ∈ {Qi, Fi,Mi} and a run r ∈ 〈ri−1, Adi, t〉,
we say that X is stable after r if for all t′ ≥ t for all extensions r′ ∈ 〈r,Adi, t′〉, X(t′) = X(t).
The following lemma (proven in Section C of the appendix) asserts several technical facts implied
directly by Definitions 1 and 3.
Lemma 2. For all t ≥ ti−1 and r ∈ 〈ri−1, Adi, t〉, all of the following holds at time t in r:
1. Qi(t) ⊆ δ(Covi(t)) \ F
2. Qi(t) ⊆ Qi(t+ 1)
3. Fi(t) ⊆ Fi(t+ 1)
4. |Fi(t)| − |Qi(t)| ≤ 1
5. |Qi(t)| ≤ f
6. |Fi(t)| ≤ f + 1
7. Fi(t) = Fi(t+ 1) =⇒ Mi(t) ⊆Mi(t+ 1)
8. |Mi(t)| ≤ f + 1
9. |δ(Covi(t)) \ F | ≥ f =⇒ |Qi(t)| ≥ f
10. |δ(Covi(t)) \ F | < f =⇒ δ(Rri(t)) \ F = ∅
11. (Qi(t) ∪Mi(t)) ∩ δ−1(Rri(t)) = ∅
The following corollary follows immediately from the claims 2–3 and 5–8 of Lemma 2.
6
Corollary 1. There exists a run r ∈ 〈ri−1, Adi〉 such that Fi, Qi, and Mi are all stable after r.
We first show that ri−1 can be extended with a complete high-level write Wi by a new client
ci such that the environment behaves like Adi until Wi returns. Roughly, the reason for this is
that Adi guarantees that after ri−1, ci would only miss responses from for the writes invoked on at
most f servers (see Claim 1) as well as those that might have been invoked in ri−1 by other clients
{c1, . . . , ci−1}, which ci is unaware of. As a result, the involved servers and clients would appear to
ci as faulty after ri−1, and therefore, to ensure obstruction freedom, it must complete Wi without
waiting for the outstanding writes to respond.
Lemma 3. Let r ∈ 〈ri−1, Adi〉 be a run consisting of ri−1 followed by a high-level write invocation
Wi by client ci 6∈ C(ti−1). Then, there exists a run rr ∈ 〈r,Adi〉 in which Wi returns.
By Corollary 1, there exists an extension r′ ∈ 〈r,Adi, t′〉 where t′ > ti−1 such that Qi, Fi, and Mi
are all stable after r′. If Wi returns in r′, we are done. Otherwise, we will first bound the number of
servers |Qi(t′) ∪Mi(t′)| controlled by Adi as follows:
Claim 1. Consider a time t > ti−1, and a run r ∈ 〈ri−1, Adi, t〉. If Mi is stable after r, then
|Qi(t) ∪Mi(t)| ≤ f .
Proof. By Lemma 2.5, |Qi(t)| ≤ f . Thus, if Mi(t) = ∅, then |Qi(t)∪Mi(t)| ≤ f as needed. Otherwise
(Mi(t) 6= ∅), we show that |Fi(t)| = |Qi(t)| + 1. Suppose to the contrary that |Fi(t)| 6= |Qi(t)| + 1.
Since by Lemma 2.4, |Fi(t)| ≤ |Qi(t)|+ 1, the only possibility for |Fi(t)| 6= |Qi(t)|+ 1 is if |Fi(t)| ≤
|Qi(t)|. Thus, by Definition 1.7, Gi(t) = ∅, and hence, by Definition 2, no writes on the registers
in δ−1(Mi(t)) are blocked. However, since Mi(t) 6= ∅, at least one register in δ−1(Mi(t)) must
have an outstanding write. Therefore, by Definition 3.3(a), there exists time t′, and an extension
r′ ∈ 〈r,Adi, t′〉 such that one of the registers on some server s ∈ Mi(t) responds at time t′. Thus,
s ∈ Fi(t′), and therefore, s 6∈ Mi(t′). Hence, Mi is not stable after r. A contradiction to the
assumption.
Since Fi(t) ⊆ F , |F \Fi(t)|+ |Fi(t)| = |F | = f+1. Hence, |F \Fi(t)| = f+1−|Fi(t)| = f−|Qi(t)|.
Since by Definition 1.6, Mi(t) ⊆ (F \ Fi(t)), |Mi(t)| ≤ |F \ Fi(t)| = f − |Qi(t)|. Thus, we receive
|Qi(t) ∪Mi(t)| ≤ |Qi(t)|+ |Mi(t)| ≤ f as needed.
We are now ready to complete the proof of Lemma 3:
Proof of Lemma 3. By Claim 1, |Qi(t′)∪Mi(t′)| ≤ f , and by Lemma 2.11, no base registers on servers
in Qi(t
′) ∪Mi(t′) have ever responded to any low-level writes issued after ti−1. Thus, there exists a
finite run r′′, which is identical to r′ except all servers in Qi(t′) ∪Mi(t′) crash immediately after r
and each client c1, . . . , ci−1 fails before any of its covering writes on registers in Cov(ti−1) responds.
By f -tolerance and obstruction freedom, there exists a fair extension r′′σ of r′′ (i.e., r′′σ /∈ 〈r′′, Adi〉)
such that Wi returns in r
′′σ. Since Qi∪Mi is stable after r′, the set of registers precluded by Adi from
responding in r′ is identical to that in r′′, and by Assumption 1, no write with a missing response is
linearized, r′ is indistinguishable from r′′ to ci. Thus, r′σ ∈ 〈r,Adi〉, and since σ includes the return
event of Wi, r
′σ satisfies the lemma.
We next show that in order to guarantee safety in the face of the environment behaving like Adi, Wi
must trigger a low-level write on at least one non-covered register on each server in a set of 2f + 1
servers. An illustration of the runs constructed in the proof appears in Figure 2 in Appendix C.
Lemma 4. Consider a run r ∈ 〈ri−1, Adi, tr〉 where tr > ti−1, consisting of ri−1 followed by a
complete high-level write invocation by client ci 6∈ C(ti−1) that returns at time tr. Then, |δ(Tri(tr) \
Cov(ti−1))| > 2f .
7
Proof. Denote X , δ(Tri(tr) \ Cov(ti−1)), and assume by contradiction that |X| ≤ 2f . Let S1 =
Fi(tr), S2 = Qi(tr), S3 = X ∩ (F \ Fi(tr)) and S4 = X \ (S1 ∪ S2 ∪ S3). Note that S1, S2, S3, S4 are
pairwise disjoint, and X = S1 ∪ S2 ∪ S3 ∪ S4.
We first show that |S1 + S4| ≤ f . By Lemma 2.6, |S1| ≤ f + 1. However, if |S1| = f + 1,
then by Lemma 2.4, |S1| − |S2| = f + 1 − |S2| ≤ 1, and therefore, |S1| + |S2| ≥ 2f + 1 violating
the assumption. Hence, |S1| ≤ f . By Lemma 2.5, |S2| ≤ f . If |S2| = f , then by assumption,
|S1 ∪ S3 ∪ S4| = |S1| + |S3| + |S4| ≤ f , and therefore, |S1 + S4| = |S1| + |S4| ≤ f . And if |S2| < f ,
then by Definitions 1.4 and 3, |S4| = 0. Hence, |S1 + S4| = |S1|+ |S4| ≤ f .
The proof proceeds by applying the partitioning argument to the sets Si, i ∈ [4] (see Appendix C).
The following corollary follows immediately from Lemma 4, Definitions 1.4 and 3, and the choice of
|F | = f + 1:
Corollary 2. Consider a run r ∈ 〈ri−1, Adi, tr〉 where tr > ti−1, consisting of ri−1 followed by a
complete high-level write invocation by client ci 6∈ C(ti−1) that returns at time tr. Then, |Qi(tr)| = f .
We are now ready to complete the proof of the induction step of Lemma 1:
Proof of the induction step (Lemma 1). By Lemma 3, there exists a run r ∈ 〈ri−1, Adi, tr〉, tr > ti−1,
in which ri−1 is followed by a complete high-level write invocation Wi by client ci 6= ci−1 writing a
value vi 6= vi−1 and returning at time tr. By Corollary 2, |Qi(tr)| = f , and therefore, by Lemma 2.4,
|Fi(tr)| = f + 1. Since Fi(tr) ⊆ F and |F | = f + 1, we conclude that Fi(tr) = F . Hence, by
Definition 1.6, Mi(tr) = ∅, which by Definition 1.7, implies that Gi = ∅. Thus, by Definition 2, no
writes on the registers in δ−1(F ) triggered after ti−1 are blocked.
Hence, by Definition 3.3(a), there exists an extension r′ ∈ 〈r,Adi, t′〉, for some t′ ≥ tr, such that
δ(Covi(t
′)) ∩ F = ∅. We now show that ri = r′ and ti = t′ satisfy the lemma. By the induction
hypothesis and the construction of extension r′, r′ is a write-only failure-free sequential extension of
ri−1 ending at time t′ that consists of i complete high-level writes of values v1, . . . , vi by i distinct
clients c1, . . . , ci. It remains to show that the implications (a)–e) hold for ti = t
′:
a) |Cov(t′)| ≥ if : By the induction hypothesis |Cov(ti−1)| ≥ (i − 1)f , and by Definition 3,
Cov(ti−1) ⊆ Cov(t′). Therefore, we left to show that |Cov(t′) \ Cov(ti−1)| ≥ f . Since by
Corollary 2, |Qi(tr)| = f , and by Lemma 2.2, Qi(tr) ⊆ Qi(t′), we get |Qi(t′)| = f . By
Definition 1.4, |Covi(t′)| ≥ |Qi(t′)|, and by Definition 1.3, |Cov(t′) \ Cov(ti−1)| = |Covi(t′)|.
Therefore, we get |Cov(t′) \ Cov(ti−1)| ≥ f .
b) δ(Cov(t′)) ∩ F = ∅: By the induction hypothesis we get that δ(Cov(ti−1)) ∩ F = ∅, and
by construction of r′ we get that δ(Covi(t′)) ∩ F = ∅. By Definition 1.3, Cov(t′) = Covi(t′) ∪
Cov(ti−1). Therefore, δ(Cov(t′)) ∩ F = ∅.
Resource Complexity. We will now use Lemma 1 to characterize the minimum resource complex-
ity of the algorithms implementing an f -tolerant obstruction-free WS-Safe k-register as a function
of the number |S| of available servers. First, it is easy to see that if |S| ≤ 2f , then no such al-
gorithm can exist. This result is implied by an extended statement of Lemma 1 (see Theorem 5
in Appendix C), and can also be shown directly by a straightforward application of a partitioning
argument as discussed in [28, 8]. If |S| > 2f , then we have the following:
Theorem 1. For all k > 0, f > 0, let A be an f -tolerant algorithm emulating an obstruction-free
WS-Safe k-register using a collection S of servers such that |S| ≥ 2f + 1. Then, A uses at least
kf +
⌈
kf
|S|−(f+1)
⌉
(f + 1) base registers (i.e., |δ−1(S)| ≥ kf +
⌈
kf
|S|−(f+1)
⌉
(f + 1)).
8
Proof. Let G ⊆ S be the set consisting of all servers that store at least
⌈
kf
|S|−(f+1)
⌉
base registers
(i.e., ∀s ∈ G, |δ−1({s})| ≥
⌈
kf
|S|−(f+1)
⌉
and ∀s ∈ S \ G, |δ−1({s})| <
⌈
kf
|S|−(f+1)
⌉
). We first show
that |G| ≥ f + 1. Suppose toward a contradiction that |G| < f + 1, and pick a set F , such that
|F | = f + 1 and S ⊃ F ⊃ G. By Lemma 1.a)-b), there exists a run r of A consisting of t steps
such that |Cov(t)| ≥ kf and δ(Cov(t)) ∩ F = ∅. Thus, by the pigeonhole argument, and since
|S \F | = |S|− (f +1), there is at least one server in S \F that stores at least
⌈
kf
|S|−(f+1)
⌉
. Therefore,
since F ⊃ G and the number of objects stored on a server is an integer, we get that there is at least
one server in S \G that stores at least
⌈
kf
|S|−(f+1)
⌉
base registers. A contradiction.
We get that |δ−1(G)| ≥
⌈
kf
|S|−(f+1)
⌉
(f + 1). Now, again by Lemma 1.a)-b), there exists a run
r′ of A consisting of t′ steps such that |Cov(t′)| ≥ kf and δ(Cov(t′)) ∩ G = ∅, meaning that
|δ−1(S \G)| ≥ kf . Therefore, we get that |δ−1(S)| ≥ kf +
⌈
kf
|S|−(f+1)
⌉
(f + 1).
The following bound on the number of registers required to emulate a single (i.e., non-fault-
tolerant) max-register is a direct consequence of Theorem 1 (see Appendix C for the proof):
Theorem 2 (Resource Complexity of k-max-register). For all k > 0, any algorithm implementing
a wait-free k-writer max-register from a collection of wait-free MWMR atomic base registers uses at
least k base registers.
In Appendix C, we prove an extended statement of Lemma 1, and use it to show three additional
lower bounds as discussed in Section 1.
3.3 Upper Bound
In this section we present an f -tolerant construction emulating a wait-free WS-Regular k-register for
all combinations of values of the parameters k > 0, f > 0, and n where n > 2f . Our construction is
carefully crafted to deal with the adversarial behaviour (Definition 3) that was exploited in the proof
of Lemma 1 while minimizing the resource complexity. Similarly to multi-writer ABD [22, 34, 29],
our algorithm uses read and write quorums to read from and write to registers. However, since RMW
objects are replaced with read/write registers, and covering low-level writes belonging to old writes
can overwrite registers at any time, the quorums in our case must have a larger intersection.
Let z ,
⌊n−(f+1)
f
⌋
and y , zf + f + 1, we construct a collection R of m = ⌊kz ⌋ disjoint sets
R0, . . . , Rm−1, each of which consist of y registers, and if k/z is not an integer, then we add to R
another disjoint set Rm of (k −
⌊
k
z
⌋
z)f + f + 1 registers. Intuitively, z is the maximum number
of writers that can be supported by a single set of y registers as can be deduced from Lemma 1’s
argument. If z divides k, then exactly k/z such sets are needed to accommodate the total of k writers.
Otherwise, the remaining k mod z writers are moved to an overflow set Rm. Note that for all Ri ∈ R,
2f+1 ≤ |Ri| ≤ n. Then, we distribute the registers in each set Ri on servers in S so that every register
in Ri is mapped to a different server (i.e., |δ(Ri)| = |Ri|). Figure 1 demonstrates a possible mapping
from registers to servers. All in all, we use ΣRi∈R|Ri| =
⌊
k
z
⌋
y + (k − ⌊kz ⌋z)f + (f + 1)(⌈kz ⌉− ⌊kz ⌋) =
· · · = kf + ⌊kz ⌋(f + 1) = kf + ⌈ k⌊n−(f+1)
f
⌋⌉(f + 1) registers.
Figure 1: A possible mapping
from R to S in case n = 6, k = 5,
and f = 2.
The resulting layout is then used to derive the read and write
quorums as follows: for every set Ri ∈ R, any subset of Ri of size
|Ri| − f is a write quorum for all writers cj such that j =
⌊
i
z
⌋
;
and any subset of registers consisting of all registers mapped to
n − f servers is a read quorum (i.e., the set of the read quorums
is {B ⊆ B : ∃S ∈ S s.t. |S| = 2f + 1 ∧ δ−1(S) = B}). Observe
that by construction of R, for every set Ri ∈ R, (1) the number of
9
clients mapped to write quorums in Ri is
⌊ |Ri|−(f+1)
f
⌋
= |Ri|−(f+1)f ,
and (2) any write quorum in Ri intersects with any read quorum
on at least |Ri| − f registers. Therefore, in a write-sequential run,
the latest written value is always guaranteed to be available to subsequent reads provided every
writer c executing a high-level write W leaves no more than f pending low-level writes upon W ’s
completion. To enforce the latter, c is precluded from triggering new low-level writes on registers on
which it still has writes pending from preceding high-level writes invocations. In addition, since
the registers in every (read or write) quorum are mapped to exactly n − f servers, each quorum
access is guaranteed to terminate, and thus, the algorithm is wait-free. In Appendix D we give the
algorithm’s full details, present its pseudo-code, and prove the following result:
Theorem 3. For all k > 0, f > 0, and n > 2f , there exists an f -tolerant algorithm emulating
a wait-free WS-Regular k-register using a collection of n servers storing kf +
⌈
k
z
⌉
(f + 1) wait-free
z-writer/multi-reader atomic base registers where z =
⌊n−(f+1)
f
⌋
.
4 Discussion and Future Directions
5 Discussion and Future Work
We studied space complexity of emulaiting an f -tolerant register from fault-prone base objects as
a function of the base object type, the number of writers k, the number of available servers n, and
the failure threshold f . For the three object types considered, we established a sharp separation (by
factor k) between registers and both max-registers and CAS in terms of the number of objects of the
respective types required to support the emulation; we showed that no such separation between max-
registers and CAS exists. Interestingly, these results shed light on the resource complexity bounds
in the standard shared memory model (i.e., without object failures) as evidenced by our proof of
a lower bound on the number of registers required for implementing a max-register for k writers.
Our main technical contribution comprises the lower bound of
⌈
k
n−(f+1)
f
⌉
(f + 1) + kf and the upper
bound of
⌈
k
bn−(f+1)
f
c
⌉
(f + 1) + kf on the resource complexity of emulating an f -tolerant k-writer
register from n fault-prone servers storing read/write registers. To strengthen our lower bound, it
was proved for emulations satisfying weak liveness and safety properties.
Future directions. First, for some choices of k and n, our bounds leave a small gap that can be
closed. Second, an interesting question that arises is whether our lower bound is tight for stronger
properties. In the special case of n = 2f + 1 servers, emulation with stronger regularity [35] is
possible with (2f + 1)k registers (tight to our lower bound). However, the question of the general
case (n ≥ 2f + 1) remains open. In addition, since atomicity usually requires readers to write, it is
interesting to investigate whether the space complexity (assuming read/write registers) in this case
also linearly depends on the number of readers.
Our results suggest a new classification of the data types based on space complexity of fault-
tolerant emulations built from base objects of these types, which is fundamentally different from
those established by [23] and [17]. A promising future direction is to extend this classification with
additional types (e.g., multiple assignment), and potentially generalize it into a full-fledged hierarchy
of its own.
Another possible direction is to consider the time complexity of the emulations. For example, we
showed that although a max-register can be implemented from a single CAS, the time complexity
of the implementation is high. An interesting open question is to determine whether this tradeoff is
inherent.
10
References
[1] Yehuda Afek, Hagit Attiya, Arie Fouren, Gideon Stupp, and Dan Touitou. Long-lived renaming
made adaptive. PODC ’99, pages 91–103, New York, NY, USA, 1999. ACM.
[2] Marcos K. Aguilera, Burkhard Englert, and Eli Gafni. On using network attached disks as shared
memory. In Proceedings of the Twenty-second Annual Symposium on Principles of Distributed
Computing, PODC ’03, pages 315–324, New York, NY, USA, 2003. ACM.
[3] Marcos K. Aguilera, Idit Keidar, Dahlia Malkhi, Jean-Philippe Martin, and Alexander Shraer.
Reconfiguring replicated atomic storage: A tutorial. Bulletin of the EATCS, 102:84–108, 2010.
[4] James Aspnes, Hagit Attiya, and Keren Censor. Max registers, counters, and monotone circuits.
In Proceedings of the 28th ACM symposium on Principles of distributed computing, pages 36–45.
ACM, 2009.
[5] Hagit Attiya, Amotz Bar-Noy, and Danny Dolev. Sharing memory robustly in message-passing
systems. J. ACM, 42(1):124–142, January 1995.
[6] Hagit Attiya and Faith Ellen. Impossibility Results for Distributed Computing, volume 5, chap-
ter 6, pages 1–162. 2014.
[7] Hagit Attiya and Arie Fouren. Algorithms adapting to point contention. J. ACM, 50(4):444–468,
July 2003.
[8] Hagit Attiya and Jennifer Welch. Distributed Computing: Fundamentals, Simulations, and
Advanced Topics, chapter 10.4, page 234. Wiley, 2nd edition, 2004.
[9] Cristina Basescu, Christian Cachin, Ittay Eyal, Robert Haas, Alessandro Sorniotti, Marko
Vukolic, and Ido Zachevsky. Robust data sharing with key-value stores. In DSN, pages 1–
12, 2012.
[10] James E. Burns and Nancy A. Lynch. Bounds on shared memory for mutual exclusion. Inf.
Comput., 107(2):171–184, 1993.
[11] Viveck R. Cadambe, Nancy A. Lynch, Muriel Me´dard, and Peter M. Musial. A coded shared
atomic memory algorithm for message passing architectures. Distributed Computing, 30(1):49–
73, 2017.
[12] Viveck R. Cadambe, Zhiying Wang, and Nancy A. Lynch. Information-theoretic lower bounds
on the storage cost of shared memory emulation. In Proceedings of the 2016 ACM Symposium on
Principles of Distributed Computing, PODC 2016, Chicago, IL, USA, July 25-28, 2016, pages
305–313, 2016.
[13] Gregory Chockler, Rachid Guerraoui, and Idit Keidar. Amnesic distributed storage. In An-
drzej Pelc, editor, Distributed Computing: 21st International Symposium, DISC 2007, Lemesos,
Cyprus, September 24-26, 2007. Proceedings, pages 139–151, Berlin, Heidelberg, 2007. Springer
Berlin Heidelberg.
[14] Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohan-
non, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, and Ramana Yerneni. Pnuts: Yahoo!’s
hosted data serving platform. Proc. VLDB Endow., 1(2):1277–1288, August 2008.
[15] Partha Dutta, Rachid Guerraoui, Ron R. Levy, and Marko Vukolic. Fast access to distributed
atomic memory. SIAM Journal on Computing, 39(8):3752–3783, 2010.
11
[16] Amazon DynamoDB. http://aws.amazon.com/dynamodb/.
[17] Faith Ellen, Rati Gelashvili, Nir Shavit, and Leqi Zhu. A complexity-based hierarchy for mul-
tiprocessor synchronization. arXiv preprint arXiv:1607.06139, 2016.
[18] Burkhard Englert and Alexander A. Shvartsman. Graceful quorum reconfiguration in a robust
emulation of shared memory. In International Conference on Distributed Computing Systems
(ICDCS), pages 454–463, 2000.
[19] Rui Fan and Nancy Lynch. Efficient Replication of Large Data Objects, pages 75–91. Springer
Berlin Heidelberg, Berlin, Heidelberg, 2003.
[20] Rati Gelashvili. On the optimal space complexity of consensus for anonymous processes. In
Yoram Moses, editor, Distributed Computing: 29th International Symposium, DISC 2015,
Tokyo, Japan, October 7-9, 2015, Proceedings, pages 452–466, Berlin, Heidelberg, 2015. Springer
Berlin Heidelberg.
[21] Chryssis Georgiou, Nicolas C. Nicolaou, and Alexander A. Shvartsman. Fault-tolerant semifast
implementations of atomic read/write registers. Journal of Parallel Distributed Computing,
69(1):62–79, 2009.
[22] Seth Gilbert, Nancy A. Lynch, and Alexander A. Shvartsman. Rambo: a robust, reconfigurable
atomic memory service for dynamic networks. Distributed Computing, 23(4):225–272, 2010.
[23] Maurice Herlihy. Wait-free synchronization. ACM Transactions on Programming Languages
and Systems (TOPLAS), 13(1):124–149, 1991.
[24] Maurice Herlihy and Jeannette M. Wing. Linearizability: A correctness condition for concurrent
objects. ACM Trans. Program. Lang. Syst., 12(3):463–492, 1990.
[25] Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. Zookeeper: wait-free
coordination for internet-scale systems. In USENIX ATC, Berkeley, CA, USA, 2010.
[26] P. Jayanti, T. Chandra, , and S. Toueg. Fault-tolerant wait-free shared objects. Journal of the
ACM, 45(3):451–500, 1998.
[27] Leslie Lamport. On interprocess communication. Distributed computing, 1(2):86–101, 1986.
[28] Nancy Lynch. Distributed Algorithms, chapter 17.1.4, pages 580–582. Morgan Kaufman, 1996.
[29] Dahlia Malkhi and Michael K. Reiter. Byzantine quorum systems. Distributed Computing,
11(4):203–213, 1998.
[30] mongoDB. http://www.mongodb.org/.
[31] Jun Rao, Eugene J. Shekita, and Sandeep Tata. Using paxos to build a scalable, consistent, and
highly available datastore. PVLDB, 4(4):243–254, 2011.
[32] Riak. http://basho.com/riak.
[33] Amazon Simple Storage Service (Amazon S3). http://aws.amazon.com/s3/.
[34] C. Shao, J. L. Welch, E. Pierce, and H. Lee. Multiwriter consistency conditions for shared
memory registers. SIAM Journal on Computing, 40(1), 2011.
[35] Cheng Shao, Jennifer L Welch, Evelyn Pierce, and Hyunyoung Lee. Multiwriter consistency
conditions for shared memory registers. SIAM Journal on Computing, 40(1):28–62, 2011.
12
[36] Amazon SimpleDB. http://aws.amazon.com/simpledb/.
[37] Alexander Spiegelman, Yuval Cassuto, Gregory V. Chockler, and Idit Keidar. Space bounds for
reliable storage: Fundamental limits of coding. In Proceedings of the 2016 ACM Symposium on
Principles of Distributed Computing, PODC 2016, Chicago, IL, USA, July 25-28, 2016, pages
249–258, 2016.
[38] Microsoft Azure Storage. http://www.windowsazure.com/en-us/manage/services/storage.
[39] Leqi Zhu. A tight space bound for consensus. In Proceedings of the Forty-eighth Annual ACM
Symposium on Theory of Computing, STOC ’16, pages 345–350, New York, NY, USA, 2016.
ACM.
13
A Formal Model
A.1 Shared Objects
A shared object supports concurrent execution of operations performed by some set, C = {c1, c2, . . . },
of client processes. Each operation has an invocation and response. An object schedule is a sequence
of the operation invocations and their responses. An invoked operation is complete in a given schedule
if the operation’s response is also present in the schedule, and pending otherwise. For a schedule σ,
ops(σ) denotes the set of all operations that were invoked in σ, and complete(σ) (resp., pending(σ))
denotes the subset of ops(σ) consisting of all the complete (resp., pending) operations. Also, for a
set X of operations, we use σ|X to denote the subsequence of σ consisting of all the invocation and
responses of the operations in X.
An operation op precedes an operation op′ in a schedule σ, denoted op ≺σ op′ iff op′ is invoked
after op responds in σ. Operations op and op′ are concurrent in σ, if neither one precedes the other.
A schedule with no concurrent operations is sequential. Given a schedule σ, we use σ|i to denote
the subsequence of σ consisting of the actions client ci. The schedule is well-formed if each σ|i is a
sequential schedule. In the following, we will only consider well-formed schedules.
The object’s sequential specification is a collection of the object’s sequential schedules in which
all operations are complete. For an object schedule σ, a linearization Lσ of σ is a sequential schedule
consisting of all operations in complete(σ) along with their responses and a subset of pending(σ), each
of which being assigned a matching response, so that Lσ satisfies both the σ’s operation precedence
relation (≺σ) and the object sequential specification.
A.2 Registers
A read/write register object (or simply a register) supports two operations of the form write(v),
v ∈ V als, and read() returning ack and v ∈ V als respectively where V als is the register value
domain. Its sequential specification is the collection of all sequential schedules where every read
returns the value written by the last preceding write or an initial value v0 ∈ V als if no such write
exists.
A register is multi-writer (MW) (resp., multi-reader (MR)) if it can be written (resp., read) by
an unbounded number of clients. A k-writer register, or simply, k-register, is a register that can be
written by at most k > 0 distinct clients. A register is single-writer (SW) (resp., single-reader (SR))
if only one process can write (resp., read) the register. For a register schedule σ, we use writes(σ)
and reads(σ) to denote the sets of all write and read operations invoked in σ. We say that σ is
write-sequential if no two writes in writes(σ) are concurrent.
A.3 Consistency Conditions
Consistency conditions specify the shared object behaviour when accessed concurrently by the clients.
Below, we introduce a number of consistency conditions that will be used throughout this paper.
They are expressed as a set of schedules C satisfying one of the following requirements:
Atomicity [24] For all schedules σ ∈ C, σ has a linearization.
Write-Sequential Regularity (WS-Reg) For all σ ∈ C, if σ is write-sequential, then for each
rd ∈ reads(σ) ∩ complete(σ) there is a linearization Lrd of σ|writes(σ) ∪ {rd}.
Write-Sequential Safety (WS-Safe) As WS-Reg, but only required to hold for complete reads
that are not concurrent with any writes.
14
A.4 System Model
We consider an asynchronous fault-prone shared memory system [26] consisting of a set of shared base
objects B = {b1, b2, . . . }. The objects are accessed by a collection of clients in the set C = {c1, c2, . . . }.
We consider a slight generalization of the model in [26] where the objects are mapped to a set
S = {s1, s2, . . . } of servers via a function δ from B to S. For B ⊆ B, we will write δ(B) to denote the
image of B, i.e., δ(B) = {δ(b) : b ∈ B}. Conversely, for S ⊆ S, we will write δ−1(S) to denote the
pre-image of S, i.e., δ−1(S) = {b : δ(b) ∈ S}. Note that for all B ⊆ B, |δ(B)| ≤ |B|, and conversely,
for all S ⊆ S, |δ−1(S)| ≥ |S|. Both servers and clients can fail by crashing. A crash of a server causes
all objects mapped to that server to instantaneously crash3.
We study algorithms emulating reliable k-writer registers to a set of clients. Clients interact
with the emulated register via high-level read and write operations. To distinguish the high-level
emulated reads and writes from low-level base object invocations, we refer to the former as read and
write. We say that high-level operations are invoked and return whereas low-level operations are
triggered and respond. A high-level operation consists of a series of trigger and respond actions on
base objects, starting with the operation’s invocation and ending with its return. Since base objects
are crash-prone, we assume that the clients can trigger several operations in a row without waiting
for the previously triggered operations to respond.
An emulation algorithm A defines the behaviour of clients as deterministic state machines where
state transitions are associated with actions, such as trigger/response of low-level operations. A
configuration is a mapping to states from system components, i.e., clients and base objects. An
initial configuration is one where all components are in their initial states.
A run of A is a (finite or infinite) sequence of alternating configurations and actions, beginning
with some initial configuration, such that configuration transitions occur according to A. We use the
notion of time t during a run r to refer to the configuration reached after the tth action in r. A run
fragment is a contiguous sub-sequence of a run. A run is write-only if it has no invocations of the
high-level read operations.
We say that a base object, client, or server is faulty in a run r if it fails at some time in r, and
correct, otherwise. A run is fair if (1) for every low-level operation triggered by a correct client
on a correct base object, there is eventually a matching response, and (2) every correct client gets
infinitely many opportunities to both trigger a low-level operation and execute the return actions.
We say that a low-lever operation on a base object is pending in run r if it was triggered but has no
matching response in r. We assume that the base objects are atomic (as defined in Section A.3)
A.5 Properties of the Emulation Algorithms
Safety The emulation algorithm safety will be expressed in terms of the consistency conditions
specified in Section A.3. An emulation algorithm A satisfies a consistency condition C if for all A’s
runs r, the subsequence of r consisting of the invocations and responses of the high-level read and
write operations is a schedule in C.
Liveness We consider the following liveness conditions that must be satisfied in fair runs of an
emulation algorithm. A wait-free object is one that guarantees that every high-level operation invoked
by a correct client eventually returns, regardless of the actions of other clients. An obstruction-free
object guarantees that every high-level operation invoked by a correct client that is not concurrent
to any other operation by a correct client eventually returns.
Fault-Tolerance The emulation algorithm is f -tolerant if it remains correct (in the sense of its
safety and liveness properties) as long as at most f servers crash for a fixed f > 0.
Complexity measures The resource consumption of an emulation algorithm A in a (finite) run r
is the number of base objects used by A in r. The resource complexity [26] of A is the maximum
3Note that the original faulty shared model of [26] can be derived from our model by choosing δ to be an injective
function.
15
resource consumption of A in all its runs.
B A max-register Emulation With One CAS
We present here a wait-free emulation of an atomic max-register on top of a single CAS object. The
pseudocode appears in Algorithm 1. The CAS object supports one operation with two parameters,
exp and new; if exp is equal to the object’s current value, then the value is set to new. In any case,
the operation returns the old value.
A max-register supports two operations, write-max(v) for some v from some domain of ordered
values V that returns ok, and read-max() that returns a value from V. The sequential specification of
max-register is the following: A read-max returns the highest value among those written by write-max
before it, or v0 in case no such values.
Algorithm 1 max-register emulation from a single CAS object b
Local variables:
tmp ∈ V, initially v0
operation b.CAS(exp, new), exp, new ∈ V
prev ← b
if exp = b then
b← new
return prev
1: operation write-max(v)
2: while true do
3: tmp← b.CAS(v0, v0)
4: if tmp ≥ v then
5: return ok
6: b.CAS(tmp, v)
7: operation read-max()
8: tmp← b.CAS(v0, v0)
9: return tmp
B.1 Correctness
We say that a b.CAS(exp, new) operation is successful if b is set to new.
Observation 1. Consider a successful b.CAS(exp, new) operation for some exp and new, the the
next b.CAS(exp′, new′) operation for some exp′ and new′ returns new.
The following observation follows immediately from the fact that b.CAS(exp, new) is called only
with new ≥ exp.
Observation 2. The values returned by b.CAS(exp, new) are monotonically increasing.
We next define linearization points:
Definition 4. (linearization points)
read-max: The linearization point is line 8.
write-max: If the operation performs a successful b.CAS(tmp, v) in line 6, then the linearization
point is the last time Line 6 is performed. Otherwise, the linearization point is last time line 3
is performed.
16
Lemma 5. For every run r of Algorithm 1, the sequential run σr, produced by the linearization
points of operations in r, is a linearization of r.
Proof. The real time order of r is trivially preserved in σr. We need to show that σr preserves max-
register’s sequential specification. Let mr be a read-max() in r that returns a value v, and let tmr
be the time when b.CAS(v0, v0) (that returns v) is called by mr (line 8). We need to show that (1)
there is no write-max(v′) that precedes mr in σr s.t. v′ > v, and (2) if v 6= v0, there is a write-max(v)
that precedes mr in σr
1. Assume by way of contradiction that there is a write-max(v′) w that precedes mr in σr s.t.
v′ > v. Denote tw to be the linearization point of w in r, and note that tw < tmr. Now consider
two case:
(a) w performs line 6 at time tw (line 6 is w’s linearization point). In this case, w performs a
successful b.CAS(tmp, v′) in line 6 for some tmp at some time t′ ≤ tw < tmr. Therefore, by
Observations 1 and 2, b.CAS(v0, v0) called by mr in line 8 at time tmr returns v
′′ ≥ v′ > v.
Hence, mr returns v′′ ≥ v′ > v. A contradiction.
(b) w performs line 3 at time tw (line 3 is w’s linearization point). Thus, tw is the last time w
calls tmp ← b.CAS(v0, v0) in lime 3. Therefore, b.CAS(v0, v0), called at time tw returns
a value bigger than or equal to v′. Therefore, by Observations 1 and 2, b.CAS(v0, v0)
called by mr in line 8 at time tmr returns v
′′ ≥ v′ > v. Hence, mr returns v′′ ≥ v′ > v. A
contradiction.
2. Assume that v 6= v0. By the code and by the CAS properties, there is a write-max(v) operation
w that calls a successful b.CAS(tmp, v) (line 6) for some tmp at time tw < tmr. Therefore, by
Observations 1 and 2, the next call to b.CAS(v0, v0) (in line 3) by w returns tmp ≥ v, and
thus w does not perform line 6 again. We get that tw is the linearization point of w, and thus
w precedes mr in σr.
Theorem 4. Algorithm 1 emulates wait-free atomic max-register.
Proof. Atomicity follows from Lemma 5. We left to show wait-freedom:
• read-max(): Since b.CAS(exp, new) is wait free, read-max() is wait-free.
• write-max(v): Note that write-max(v) returns in the iteration in which b.CAS(v0, v0) in line 3
returns a value bigger than or equal to v. Therefore, by Observations 1 and 2, write-max(v)
returns in the following iteration after a successful b.CAS(tmp, v) in lime 6. Now assume in
a way of contradiction that write-max(v) do not have a successful b.CAS(tmp, v) in lime 6.
By the code and Observations 1 and 2, if b.CAS(tmp, v) in lime 6 do not succeed, then the
following b.CAS(v0, v0) in line 3 returns a bigger value that what it returned in the previous
iteration. Now let v′ be the value returned by the first b.CAS(v0, v0) in line 3, and assume
w.l.o.g. that there are k values bigger than v′ and smaller than v in V. Therefore, after at
most k iteration write-max(v) returns.
17
C Space Lower Bounds
Lemma 2 (restated). Let r ∈ 〈ri−1, Adi〉 be a run consisting of t steps. Then, for all t ≥ ti−1, all
of the following hold:
1. Qi(t) ⊆ δ(Covi(t)) \ F
2. Qi(t) ⊆ Qi(t+ 1)
3. Fi(t) ⊆ Fi(t+ 1)
4. |Fi(t)| − |Qi(t)| ≤ 1
5. |Qi(t)| ≤ f
6. |Fi(t)| ≤ f + 1
7. Fi(t) = Fi(t+ 1) =⇒ Mi(t) ⊆Mi(t+ 1)
8. |Mi(t)| ≤ f + 1
9. |δ(Covi(t)) \ F | ≥ f =⇒ |Qi(t)| ≥ f
10. |δ(Covi(t)) \ F | < f =⇒ δ(Rri(t)) \ F = ∅
11. (Qi(t) ∪Mi(t)) ∩ δ−1(Rri(t)) = ∅
Proof. By induction on t ≥ ti−1.
Base: If t = ti−1, then Tri(t) = Rri(t) = Covi(t) = Fi(t) = ∅. Furthermore, since |δ(Covi(t)) \ F | =
0 ≤ 1 ≤ f , Qi(t) = δ(Covi(t)) \ F . Thus, all the claims hold.
Induction step: Suppose all the claims hold for all t ≥ ti−1, and consider the time t+ 1:
2.1: If |δ(Covi(t+ 1)) \ F | ≤ f , then by Definition 1.4, Qi(t+ 1) = δ(Covi(t+ 1)) \ F as needed.
Otherwise, Qi(t+1) = Qi(t). Consider an arbitrary server s ∈ Qi(t+1), and towards a contradiction,
suppose that s /∈ δ(Covi(t + 1)) \ F . Since s ∈ Qi(t + 1) = Qi(t), by the induction hypothesis,
s ∈ δ(Covi(t)) ∧ s /∈ F . Since s /∈ δ(Covi(t + 1)) \ F , we get that either (1) s /∈ δ(Covi(t + 1)) or
(2) s ∈ δ(Covi(t+ 1)) ∩ F . Since s /∈ F , (2) is false, and therefore, s /∈ δ(Covi(t+ 1)). Hence, there
exists a base register in δ−1({s}) that responded at t to a low-level write w triggered after ti−1. Since
s ∈ Qi(t), by Definition 2, w ∈ BlockedWrites(t). However, since the environment behaves like Adi
after ri−1, by Definition 3.2, w does not respond at t. A contradiction.
2.2: Towards a contradiction, suppose that there exists s ∈ S such that s ∈ Qi(t)∧ s 6∈ Qi(t+ 1).
By Definition 1.4, |δ(Covi(t+1))\F | ≤ f as otherwise, Qi(t+1) = Qi(t) contradicting the assumption.
Thus, Qi(t + 1) = δ(Covi(t + 1)) \ F , and therefore, either (1) s 6∈ δ(Covi(t + 1)) or (2) s ∈
δ(Covi(t+ 1)) ∩ F . By the induction hypothesis for 2.1, s ∈ δ(Covi(t)) ∧ s 6∈ F . Hence, (2) is false,
and it is only left to consider the case s 6∈ δ(Covi(t+1)). Thus, s ∈ δ(Covi(t)) and s 6∈ δ(Covi(t+1)),
which implies that there exists a base register in δ−1({s}) that responded at time t to a low-level write
w invoked after ti−1. Since s ∈ Qi(t), by Definition 2, w ∈ BlockedWrites(t). However, since the
environment behaves like Adi after ri−1, by Definition 3.2, w does not respond at t. A contradiction.
2.3: Let s ∈ Fi(t). By Definition 1.5, there exists a base register b ∈ δ−1({s}) that responded
to a low-level write triggered on b after ti−1 at time t′ such that ti−1 < t′ ≤ t < t + 1. Since
t < t + 1, the b’s response has also occurred before t + 1, and therefore, b ∈ Rri(t + 1). Hence,
b ∈ (δ−1(S) ∩Rri(t+ 1)) = Fi(t+ 1) as needed.
2.4: Toward a contradiction, suppose that |Fi(t+ 1)| − |Qi(t+ 1)| > 1. Since we already proved
that Qi(t) ⊆ Qi(t + 1), |Qi(t)| ≤ |Qi(t + 1)|. In addition, we know that ||Qi(t + 1)| − |Qi(t)|| ≤ 1,
||Fi(t + 1)| − |Fi(t)|| ≤ 1, and by the induction hypothesis |Fi(t)| − |Qi(t)| ≤ 1. Thus, |Fi(t + 1)| −
|Qi(t+ 1)| > 1 implies that (1) |Fi(t)| − |Qi(t)| = 1 (i.e., |Fi(t)| > |Qi(t)|), (2) |Qi(t+ 1)| = |Qi(t)|,
and (3) |Fi(t + 1)| = |Fi(t)| + 1. Since we already proved that Fi(t + 1) ⊇ Fi(t), (3) implies that
there exists s ∈ S such that s ∈ Fi(t+ 1)\Fi(t). Since by Definition 1.5, Fi(t+ 1) ⊆ F , s ∈ F . Thus,
s ∈ F \Fi(t). This means that either no low-level writes have been triggered on registers in δ−1({s})
after ti−1, or there is a register b ∈ δ−1({s}) that responds to a low-level write triggered on b after
ti−1. In the first case, no register on s can respond at time t, and therefore, s 6∈ Fi(t+ 1), which is a
contradiction. In the second case, we obtain that b satisfies b ∈ Covi(t), b ∈ δ−1({s}), s ∈ F \ Fi(t),
and b responds at t to a covering write w triggered after ti−1. Thus, by Definition 1.6, s ∈Mi(t) and
since |Fi(t)| > |Qi(t)|, by Definition 1.7, s ∈ Gi(t). Thus, by Definition 2, w ∈ BlockedWrites(t),
18
and since the environment behaves like Adi after ri−1, by Definition 3.2, w does not respond at t. A
contradiction.
2.5: Assume by contradiction that |Qi(t+ 1)| > f . Since by 2.1, |Qi(t+ 1)| ⊆ δ(Covi(t+ 1)) \F ,
|δ(Covi(t + 1)) \ F | > f . By Definition 1.4, Qi(t + 1) = Qi(t), and therefore, |Qi(t)| > f . A
contradiction to the inductive assumption.
2.6: By Definition 1.5, Fi(t+ 1) ⊆ F . Since |F | = f + 1, |Fi(t+ 1)| ≤ f + 1.
2.7: Suppose Fi(t) = Fi(t + 1). Consider s ∈ Mi(t), and toward a contradiction, suppose that
s 6∈ Mi(t + 1). Since by Definition 1.5, Fi(t) ⊆ F and Fi(t + 1) ⊆ F , F \ Fi(t) = F \ Fi(t + 1).
This together with the fact that s ∈ F \ Fi(t) implies that s ∈ F \ Fi(t + 1) as well. Thus, it must
be the case that s ∈ δ(Covi(t)) ∧ s 6∈ δ(Covi(t + 1)). Thus, by Definition 1.3, there exists a base
register b ∈ δ−1({s}) that responds to a low-level write invoked after ti−1 at time t which implies that
b ∈ Rri(t+ 1). Hence, by Definition 5, s ∈ Fi(t+ 1). However, since s ∈ F \ Fi(t+ 1), s 6∈ Fi(t+ 1).
A contradiction.
2.8: Since F is fixed in advance and |F | = f + 1, we receive |Mi(t+ 1)| = |δ(Covi(t+ 1)) ∩ (F \
Fi(t+ 1))| ≤ |F \ Fi(t+ 1)| ≤ |F | = f + 1.
2.9: If |δ(Covi(t))\F | < f and |δ(Covi(t+1))\F | ≥ f , then there exists a base register on a server
in S\F that is newly covered after t. Thus, we get |δ(Covi(t))\F | = f−1 and |δ(Covi(t+1))\F | = f .
By Definition 1.4, Qi(t + 1) = δ(Covi(t + 1)) \ F , and therefore, |Qi(t + 1)| = f . Otherwise, by
the induction hypothesis, |Qi(t)| ≥ f . Since |δ(Covi(t + 1)) \ F | ≥ f , we have that either (1)
|δ(Covi(t+1))\F | = f , or (2) |δ(Covi(t+1))\F | > f . Applying Definition 1.4 to (1) and (2), we get
the following: for (1), Qi(t+1) = δ(Covi(t+1))\F , which implies |Qi(t+1)| = |δ(Covi(t+1))\F | = f ,
and for (2), Qi(t+ 1) = Qi(t), and therefore, |Qi(t+ 1)| ≥ f .
2.10: Toward a contradiction, suppose that |δ(Covi(t + 1)) \ F | < f and δ(Rri(t + 1)) \ F 6= ∅.
By the induction hypothesis, |δ(Covi(t)) \ F | < f =⇒ δ(Rri(t)) \ F = ∅. We first consider the case
|δ(Covi(t)) \ F | ≥ f . Thus given |δ(Covi(t + 1)) \ F | < f , there exists a server in δ(Covi(t)) \ F
such that some register on that server responds to a low-level write w that was triggered after ti−1.
Moreover, |δ(Covi(t)) \ F | = f , and thus, by Definition 1.4, Qi(t) = δ(Covi(t)) \ F . Since the
environment behaves like Adi after ri−1, by Definition 2, w ∈ BlockedWrites(t), and therefore, by
Definition 3.2, w does not respond at t. A contradiction.
Thus, we know that |δ(Covi(t)) \ F | < f and δ(Rri(t)) \ F = ∅. And since δ(Rri(t+ 1)) \ F 6= ∅,
there exists a server s ∈ δ(Covi(t))\F such that some object on s responded at t to a low-level write
w triggered after ti−1. Since |δ(Covi(t)) \ F | < f , by Definition 1.4, Qi(t) = δ(Covi(t)) \ F . Thus,
s ∈ Qi(t). However, by Definition 2, w ∈ BlockedWrites(t), and since the environment behaves like
Adi after ti−1, by Definition 3.2, w does not respond at t. A contradiction.
2.11: Toward a contradiction, suppose that (Qi(t+ 1) ∪Mi(t+ 1)) ∩ δ(Rri(t+ 1)) 6= ∅. We will
consider the following two cases separately: (1) Qi(t + 1) ∩ δ(Rri(t + 1)) 6= ∅, and (2) Mi(t + 1) ∩
δ(Rri(t+ 1)) 6= ∅.
(1) Suppose Qi(t + 1) ∩ δ(Rri(t + 1)) 6= ∅, and let s ∈ Qi(t + 1) ∩ δ(Rri(t + 1)). If s ∈ Qi(t),
then by the induction hypothesis s 6∈ δ(Rri(t)). This means that either (a) δ−1({s}) ∩ Tr(t) = ∅,
or (b) there exists a base register in δ−1({s}) that responds to a low-level write w triggered after
ti−1. If (a) holds, then no base register can respond to a low-level write before t+ 1, and therefore,
s /∈ δ(Rri(t + 1)), which is a contradiction. If (b) is the case, then since s ∈ Qi(t), by Definition 2,
w ∈ BlockedWrites(t). Since the environment behaves like Adi after ti−1, by Definition 3.2, w does
not respond at t, which is also a contradiction.
If s /∈ Qi(t), then since s ∈ Qi(t + 1), by Definition 1.4, the only action that can follow t is a
trigger of a low-level write on some register in δ−1({s}). Since s ∈ δ(Rri(t + 1)), and the action
executed at t is not a respond, s ∈ δ(Rri(t)). Thus, Qi(t+1) 6= Qi(t) which by Definition 1.4, implies
that Qi(t+1) = δ(Covi(t+1))\F , and |Qi(t+1)| = |δ(Covi(t+1))\F | ≤ f . Since Qi(t) ⊂ Qi(t+1),
|Qi(t)| < f , and by the induction hypothesis for 2.9, we have |δ(Covi(t+ 1)) \ F | < f . Thus, by the
induction hypothesis for 2.10, we conclude that no registers on servers in S \ F have responded to
19
any low-level writes triggered between ti−1 and t. However, since s ∈ δ(Rri(t)) and, by the induction
hypothesis for 2.1, s /∈ δ(Rri(t)), s ∈ S\F has a register that responded to a low-level write triggered
after ti−1. A contradiction.
(2) Suppose that Mi(t + 1) ∩ δ(Rri(t + 1)) 6= ∅, and let s ∈ Mi(t + 1) ∩ δ(Rri(t + 1)). By
Definition 1.5, we know that s ∈ F \ Fi(t + 1), and therefore, s ∈ F and s /∈ Fi(t + 1). Thus,
s ∈ F ∩ δ(Rri(t+ 1)), and therefore, by Definition 1.5, s ∈ Fi(t+ 1). A contradiction.
We now give the full proof of Lemma 4. An illustration of the runs constructed in the proof
appear in Figure 2.
Figure 2: An illustration of the runs constructed in Lemma 4.
Lemma 4 (restated). Consider a run r ∈ 〈ri−1, Adi, tr〉 where tr > ti−1, consisting of ri−1 followed
by a complete high-level write invocation by client ci 6∈ C(ti−1) that returns at time tr. Then,
|δ(Tri(tr) \ Cov(ti−1))| > 2f .
Proof. Denote X , δ(Tri(tr) \ Cov(ti−1)), and assume by contradiction that |X| ≤ 2f . Let S1 =
Fi(tr), S2 = Qi(tr), S3 = X ∩ (F \ Fi(tr)) and S4 = X \ (S1 ∪ S2 ∪ S3). Note that S1, S2, S3, S4 are
pairwise disjoint, and X = S1 ∪ S2 ∪ S3 ∪ S4.
We first show that |S1 + S4| ≤ f . By Lemma 2.6, |S1| ≤ f + 1. However, if |S1| = f + 1,
then by Lemma 2.4, |S1| − |S2| = f + 1 − |S2| ≤ 1, and therefore, |S1| + |S2| ≥ 2f + 1 violating
the assumption. Hence, |S1| ≤ f . By Lemma 2.5, |S2| ≤ f . If |S2| = f , then by assumption,
|S1 ∪ S3 ∪ S4| = |S1| + |S3| + |S4| ≤ f , and therefore, |S1 + S4| = |S1| + |S4| ≤ f . And if |S2| < f ,
then by Definitions 1.4 and 3, |S4| = 0. Hence, |S1 + S4| = |S1|+ |S4| ≤ f .
Now let r′ be a fair extension of ri−1 consisting of t′c steps in which ri−1 is followed by (1) the
crash events of all servers in S1 ∪ S4, and (2) the respond steps of all the covering writes in ri−1 and
(and no other steps). Extend r′ with an invocation of a high-level read operation R by client crd 6= ci
at time t′c. Since |S1 + S4| ≤ f , by obstruction freedom and f -tolerance, there exists time trd > ti−1
at which R returns. Since r′ is write-sequential, by WS-Safety, R must return vi−1.
Next, let r′′ be an extension of r consisting of all steps in r up to the time tr followed by (1) the
crash events of all servers in the set S1 ∪ S4, and (2) the respond steps of all covering writes in ri−1
(and no other steps). Let t′′c > tr be the number of steps in r′′. By Assumption 1, the values that
can be read from the base registers in Cov(ti−1) at time t′′c in r′′ are identical to those that can be
read at time t′c in r′. Furthermore, by definitions 1.5 and 3, low-level writes triggered on registers
in δ−1(S2 ∪ S3) do not respond before tr in r. Thus, by Assumption 1, the values that can be read
20
from the base registers in δ−1(S2 ∪S3) at time t′c in r′ are also the same as those that can be read at
time t′′c in r′′. Thus, all registers in non-faulty servers at time t′c in r′ will appear to the subsequent
reads as having the same content as at the time t′′c in r′′c .
We now extend r′′ by letting client crd invoke high-level read R at time t′′c . Since r′ is indistin-
guishable from r′′ to crd, and R has no concurrent high-level operations, we get that R returns vi−1
in r′′. However, since Wi is the last complete write preceding R in r′′, by WS-Safety, the R’s return
value must be vi 6= vi−1. A contradiction.
Theorem 2 (restated). [Resource Complexity of k-max-register] Any algorithm implementing a wait-
free k-writer max-register from a collection of wait-free MWMR atomic base registers uses at least
k > 0 base registers.
Proof. Suppose to the contrary that there exists an algorithm A implementing a k-writer max-register
using ` < k base MWMR wait-free atomic registers. Consider a fault-prone shared memory system
consisting of n = 2f + 1 servers each of which stores ` MWMR wait-free atomic registers. Run n
copies A1, . . . , An of A, one on each server, to obtain n = 2f+1 copies of k-writer max-register. Run
a generic protocol of [34] to obtain an f -tolerant emulation A of a wait-free k-writer regular register.
By assumption, the resource complexity of A is (2f + 1)` < (2f + 1)k base registers. However, by
Theorem 1, for n = 2f + 1, it must be at least kf + k(f + 1) = (2f + 1)k. A contradiction.
We next prove an extended version of Lemma 1 and use it to show additional lower bounds (see
Theorems 5, 6, 7, and 8).
Lemma 1 (extended). For all k > 0, f > 0, let A be an f -tolerant algorithm that emulates a
WS-Safe obstruction-free k-register using a collection S of servers storing a collection B of wait-free
MWMR atomic registers. Then, for every F ⊆ S such that |F | = f + 1, there exist k failure-free
runs ri, 0 ≤ i ≤ k, of A such that (1) r0 is a run consisting of an initial configuration and t0 = 0
steps, and (2) for all i ∈ [k], ri is a write-only sequential extension of ri−1 ending at time ti > 0
that consists of i complete high-level writes of i distinct values v1, . . . , vi by i distinct clients c1, . . . , ci
such that:
a) |Cov(ti)| ≥ if
b) δ(Cov(ti)) ∩ F = ∅
c) |δ(Tr(ti) \ Cov(ti−1))| > 2f
d) |δ( Cov(ti) \ Cov(ti−1) )| ≥ f
e) Cov(ti) ⊇ Cov(ti−1)
Proof of c) – e). : Fix arbitrary k > 0, f > 0, and a set F of servers such that |F | = f + 1. We
proceed by induction on i, 0 ≤ i ≤ k. Base: Trivially holds for the run r0 of A consisting of t0 = 0
steps. Step: Assume that ri−1 exists for all i ∈ [k − 1]. We use the extension r′ constructed in the
proof of Lemma 1 in Section 3 to show the implications c) – e) of the extended version are true:
c) |δ(Tr(t′) \ Cov(ti−1))| > 2f : Follows immediately from Lemma 4 and Definition 1.1.
d) |δ( Cov(t′) \ Cov(ti−1) )| ≥ f : Since by Corollary 2, |Qi(tr)| = f , and by Lemma 2.2,
Qi(tr) ⊆ Qi(t′), we get |Qi(t′)| = f . Hence, by Definition 1.4, |δ(Covi(t′))| ≥ |Qi(t′)| ≥ f , and
by Definition 1.3, |δ( Cov(t′) \ Cov(ti−1) )| = |δ(Covi(t′))| ≥ f .
e) Cov(ti) ⊇ Cov(ti−1): Follows immediately from Definition 3.
Theorem 5. For all k > 0, and f > 0, let A be an f -tolerant algorithm emulating a WS-Safe
obstruction-free k-register using a collection S of servers. Then, |S| ≥ 2f + 1.
Proof. By Lemma 1.c), there exists a run r1 of A consisting of t1 steps such that |δ(Tr(t1)\Cov(0))| >
2f . Therefore, |S| ≥ |δ(Tr(t1))| ≥ 2f + 1.
21
Number of registers per server. Theorem 1 implies that in case |S| = 2f +1, at least (2f +1)k
registers are required. The following theorem further refines this result by showing that in this case,
each server must store at least k registers:
Theorem 6. Let |S| = 2f + 1. For all k > 0, f > 0, every f -tolerant algorithm emulating an
obstruction-free WS-Safe k-register stores at least k registers on each server in S.
Proof. Pick an arbitrary f > 0, k > 0, and suppose toward a contradiction that there is an f -tolerant
algorithm A emulating an obstruction-free WS-Safe k-register that stores less than k registers on some
server s ∈ S (i.e., |δ−1({s})| < k).
Pick an arbitrary set F ⊂ S of size |F | = f + 1 such that s /∈ F . By Lemma 1, there exist
a sequential write-only run rk consisting of k high-level write invocations W1, . . . ,Wk by k distinct
clients, and k distinct times t1 < · · · < tk such that: |δ(Cov(t1))| ≥ f and δ(Cov(t1)) ∩ F = ∅;
and for all i ∈ [k] \ {1}, |δ(Cov(ti) \ Cov(ti−1))| ≥ f , Cov(ti) ⊇ Cov(ti−1), and δ(Cov(ti)) ∩ F = ∅.
By induction on i ∈ [k], it is easy to see that all sets in the collection consisting of Cov(t1) and
Cov(ti) \ Cov(ti−1) where i ∈ [k] \ {1} are pairwise disjoint. Thus, at least f new registers become
covered at each ti, i ∈ [k]. Moreover, since no registers on the servers in F are covered at ti, all
registers that become covered at ti must be located on the servers in S\F . Therefore, since |S\F | = f
and s ∈ S \F , we conclude that at least k distinct registers on s must be covered at time tk, that is,
|δ−1({s}) ∩ Cov(tk)| ≥ k. Therefore, |δ−1({s})| ≥ k. A contradiction.
Servers with bounded storage. The following result provides a lower bound on the number of
servers for the case the storage available on each server is bounded (as it is often the case in practice)
by a known constant:
Theorem 7. Let m > 0 be an upper bound on the number of registers mapped to each server in S
(i.e., ∀s ∈ S, |δ−1({s})| ≤ m). For all f > 0 and k > 0, every f -tolerant algorithm emulating an
obstruction-free WS-Safe k-register from a collection S of servers such that |S| > 2f + 1 uses at least
dkf/me+ f + 1 servers (i.e., |S| ≥ dkf/me+ f + 1).
Proof. Fix F ⊂ S such that |F | = f +1. By Lemma 1(a), there exists an extension rk of rk−1 ending
at time tk such that |δ(Tr(tk)\Cov(tk−1))| > 2f . Since |F | = f+1, |δ(Tr(tk)\Cov(tk−1))∩F | ≤ f+1.
Hence, we receive
|δ(Tr(tk) \ Cov(tk−1)) \ F | =
|δ(Tr(tk) \ Cov(tk−1))| − |δ(Tr(tk) \ Cov(tk−1)) ∩ F | ≥ 2f + 1− f − 1 = f
Thus,
|(Tr(tk) \ Cov(tk−1)) \ δ−1(F )| ≥ |δ(Tr(tk) \ Cov(tk−1)) \ F | ≥ f (1)
On the other hand,
|(Tr(tk) \ Cov(tk−1)) \ δ−1(F )| = |(Tr(tk) \ Cov(tk−1)) ∩ δ−1(S \ F )| =
|(Tr(tk) ∩ δ−1(S \ F )) \ Cov(tk−1)| ≤ |δ−1(S \ F ) \ Cov(tk−1)| (2)
Since by Lemma 1(b) and (d), |Cov(tk−1| ≥ (k − 1)f and δ−1(S \ F ) ⊇ Cov(tk−1), we get
|δ−1(S \ F ) \ Cov(tk−1)| = |δ−1(S \ F )| − |Cov(tk−1)| ≤ (|S \ F |)m− (k − 1)f (3)
Combining (3) with (1) and (2), we get
(|S \ F |)m− (k − 1)f ≥ |(Tr(tk) \ Cov(tk−1)) \ δ−1(F )| ≥ f
Since S ⊃ F and |F | = f + 1, we obtain (|S \ F |)m− (k − 1)f = |S|m− (f + 1)m− (k − 1)f ≥ f .
Therefore, |S|m ≥ fm+m+ kf , which implies that |S| ≥ kf/m+ f + 1. Since |S| is an integer, we
conclude that |S| ≥ dkf/me+ f + 1.
22
Adaptivity to Contention Given a run fragment r of an emulation algorithm, the point con-
tention [1, 7] of r, PntCont(r), is the maximum number of clients that have an incomplete high-level
invocation after some finite prefix of r. Similarly, we use PntCont(op) to denote PntCont(rop), where
rop is the run fragment including all events between the op’s invocation and response.
The resource complexity of A is adaptive to point contention if there exists a function M such
that after all finite runs r of A, the resource consumption of A in r is bounded by M(PntCont(r)).
Likewise, the time complexity of A is adaptive to point contention if there exists a function T such
that for each client ci, and operation op, the time to complete the invocation of op by ci is bounded
by T (PntCont(op)).
We show that no WS-Safe obstruction-free MWSR register can have a fault-tolerant emulation adap-
tive to point contention:
Theorem 8. For any f > 0, there is no f -tolerant algorithm that emulates an WS-Safe obstruction-
free k-register with resource complexity adaptive to point contention.
Proof. By Lemma 1, there exists a run r of A consisting of k high-level writes by k distinct clients
such that the resource complexity grows by f for each consecutive write that completes in r whereas
the point contention remains equal 1 for the entire r. We conclude that no function mapping point
contention to resource consumption can exist, and therefore, A’s resource complexity is not adaptive
to point contention.
23
D Upper Bound Algorithm
In this section, we will give a detailed description of our upper bound construction discussed in
Section 3.3 along with the correctness proof.
Algorithm 2 ∀f > 0 ∀k > 0, ∀n = |S| ≥ 2f + 1.
Types:
TSV al = N× V, with selectors ts and val.
States = TSV al × 2TSV al × 2B × 2B with selectors tsVal, rdSet, wrSet and coverSet.
Base Objects and Servers:
∀b ∈ δ−1(S), b ∈ TSV al, initially, 〈0, v0〉.
Let z ,
⌊n−(f+1)
f
⌋
, y , zf + f + 1, and m ,
⌈
k
z
⌉
.
R = {R0, . . . , Rm−1} ⊂ 2δ−1(S) s.t.
1. ∀i ∈ {0, . . . ,m−2}, |Ri| = y. If
⌈
k
z
⌉
=
⌊
k
z
⌋
, then |Rm−1| = y. Else, |Rm−1| = (k−
⌊
k
z
⌋
z)f +f +1.
2. ∀Ri, Rj ∈ R, R1 ∩Rj = ∅.
3. ∀Ri ∈ R, |δ(Ri)| = |Ri|.
Clients states:
∀i ∈ [k], Statei ∈ States, initially,
〈〈〈0, 1〉, v0〉, ∅, Rj , ∅〉, where j =
⌊
i
z
⌋
.
Code for client ci, 1 ≤ i ≤ k:
1: operation write(v)
2: value← collect()
3: Statei.tsV al.val← v
4: Statei.tsV al.ts← value.ts+ 1
5: j ← ⌊ iz ⌋
. do not handle responds between lines 6 to 10
6: Statei.coverSet← Rj \ Statei.wrSet
7: Statei.wrSet← ∅
8: || for each b ∈ Rj
9: if b /∈ Statei.coverSet
10: trigger b.write(Statei.tsV al)
11: wait until |Statei.wrSet| ≥ |Rj | − f
12: return ack
13: scan(s)
14: for each b ∈ δ−1(s) do
15: trigger b.read()
16: wait for the matching response
17: operation read()
18: value← collect()
19: return value.val
20: collect()
21: Statei.rdSet← ∅
22: || for each s ∈ S do
23: scan(s)
24: wait for n− f scans to complete
25: ts← max({ts′ | 〈ts′, ∗〉 ∈ Statei.rdSet})
26: return 〈v, ts′〉 ∈ Statei.rdSet : ts′ = ts
27: upon receiving b.read() respond res do
28: Statei.rdSet← Statei.rdSet ∪ {res}
29: upon receiving b.write(∗) respond do
30: if b ∈ Statei.coverSet then
31: Statei.coverSet← Statei.coverSet \ {b}
32: trigger b.write(Statei.tsV al)
33: else
34: Statei.wrSet← Statei.wrSet ∪ {b}
The registers store values in V each of which is associated with a unique timestamp. (Note that
since safety is required only in write-sequential runs, we do not need to break ties with clients’ ids.)
To write a value v to the emulated register, a client ci first accesses a read quorum (via collect() in
lines 22–26) and selects a new timestamp ts which is higher than any other timestamp that has been
returned. It then proceeds to trigger low-level writes of 〈ts, v〉 on registers in Rj =
⌊
i
z
⌋
, so as to
ensure that (1) 〈ts, v〉 is stored in a write quorum wq (lines 8–11), and (2) no more than f registers
in Rj are left covered by ci’s writes (the current and the previous operations). The latter is achieved
by preventing ci from triggering a new low-level write on every register that has not yet responded
24
to the previously triggered one (lines 9–10). To read a value, a client simply reads all registers in a
read quorum, via collect(), and returns the value having the highest timestamp.
The space complexity of the algorithm is ΣRi∈R|Ri| =
⌊
k
z
⌋
y+ (k− ⌊kz ⌋z)f + (f + 1)(⌈kz ⌉− ⌊kz ⌋) =
· · · = kf + ⌊kz ⌋(f + 1) = kf + ⌈ k⌊x−(f+1)
f
⌋⌉ registers. Below, we prove that the algorithm satisfies
wait-freedom and write-sequential regularity. The following observation follows from code and the
construction of the sets in R; (1) writers never trigger low-level writes on base object with pending
low-lever writes from previous writes, (2) writers wait for n− f base objects to reply (line 24), and
for every set Ri ∈ R, the number of client that write to registers in Ri is
⌊ |Ri|−(f+1)
f
⌋
= |Ri|−(f+1)f .
Observation 3. For every 0 < i ≤ k for every time t in a run r, if writer ci have no pending write
at t then it covers at most f base objects at time t.
Lemma 6. Consider a write-sequential run r of the algorithm, and consider two sequential writes
Wi,Wj in r s.t. Wi precedes Wj. Then Wj’s value is associated with a bigger timestamp than Wi’s
value.
Proof. Since Wi precedes Wj , Wj starts the collect in line 2 after Wi returns. Wi triggers low level
writes with its value and timestamp on base objects in Rl (l =
⌊
i
z
⌋
) that are not covered by its
previous writes, and waits for |Rl| − f low level writes to respond (line 11) before it returns. Thus,
since |δ(Rl)| = |Rl|, Wj starts its collect after Wi writes its timestamp to at least |Rl| − f base
objects in different servers, none of which is covered by low-level write of Wi previous writes.
Moreover, since the number of writers excludingWi that write to base objects inRj is
⌊ |Rl|−(f+1)
f
⌋−
1 = |Ri|−(f+1)f − 1, readers do not write, and each writer covers at most f base objects, we get that
at least f + 1 servers has a base object that stores Wi’s timestamp when Wj begins its collect. Now
since collect reads all base object in at least n − f servers (line 24), Wj sees Wi’s timestamp and
picks a bigger one (line 4).
Corollary 3. Consider a write-sequential run r of the algorithm. If write Wi precedes write Wj,
then Wj is associated with a bigger timestamp than Wi.
Lemma 7. Consider a write-sequential run r of the algorithm, and a read operation rd and a
write W in r. Let ts be the timestamp associated with W . If W precedes rd, than rd returns a
value associated with timestamp ts′ ≥ ts.
Proof. Let t be the time when W returns, and assume w.l.o.g that W is performed by client ci s.t.⌊
i
z
⌋
= j. Before W returns it ci triggers low level writes with its value and timestamp on base
objects in Rj that are not covered by its previous writes, and waits for |Rj | − f low level writes
to respond. The number of clients excluding ci that trigger low-level writes on base objects is Rj is⌊ |Rl|−(f+1)
f
⌋ − 1 = |Ri|−(f+1)f − 1, and by Observation 3, each client covers at most f base objects
at time t. By Corollary 3 and since readers do not write, every low level write in r that is triggered
after time t is associated with a bigger timestamp than ts. Therefore, since |δ(Rl)| = |Rl|, there is
a set of f + 1 base objects, each of which mapped to a different server, s.t. at any time t′ ≥ t the
timestamp each of them stores is bigger than or equal to ts.
Since W precedes rd, rd starts the collect after time t. And since collect reads all base object in
at least n− f servers, rd sees at least one value associated with timestamp bigger than or equal to
ts, and thus, returns a value associated with timestamp ts′ ≥ ts.
Definition 5. For every write-sequential run r, for every read rd in r that returns a value associated
with timestamp ts we define the sequential run σrrd as follows: All the completed write operations in
25
r are ordered in σrrd by their timestamp, and rd is added after the write operation that is associated
with ts.
In order to show that the algorithm simulates a write-sequential regular register we need to proof
that for every write-sequential run r, for every read rd, σrrd preserves the real time order of r and
the sequential specification. Note that the sequential specification is satisfied by construction, and
we prove the real time order in the next lemma.
Lemma 8. For every write-sequential run r, for every complete read rd that returns a value associ-
ated with timestamp ts in r, σrrd preserves r’s operation precedence relation (real time order).
Proof. By Corollary 3, the real time order of r between every two write operations is preserved in
σrrd . We left to show that the real time order of r between rd and any write W in σrrd is preserved.
Consider two cases:
• W precedes rd in r. By Lemma 7, W is associated with a timestamp smaller than or equal to
ts, and thus, by construction of σrrd the real time order between rd and W is preserved.
• rd precedes W in r. Let Wts be the write operation associated with timestamp ts. Since
rd returns a value associated with timestamp ts, Wts starts before rd completes, and since
r is write-sequential, wts precedes W in r. Thus, by lemma 6, W is associated with bigger
timestamp than ts. Therefore, by construction of σrrd the real time order between rd and W
is preserved.
Corollary 4. For every write-sequential run r, for every complete read rd in r, there is a linearization
of rd and all the write operations in r.
Theorem 3 (restated). For all k > 0, f > 0, and n > 2f , there exists an f -tolerant algorithm
emulating a wait-free WS-Regular k-register using a collection of n servers storing kf +
⌈
k
z
⌉
(f + 1)
wait-free z-writer/multi-reader atomic base registers where z =
⌊n−(f+1)
f
⌋
.
Proof. By Corollary 4, the code in Algorithm 2 satisfies WS-regularity. Now notice that in both
write and read operations clients never wait for more than n − f servers to respond, and thus,
wait-freedom follows. We conclude that Algorithm 2 satisfies the theorem.
26
