On Atomic Registers and Randomized Consensus in M&M Systems by Hadzilacos, Vassos et al.
On Atomic Registers and Randomized Consensus
in M&M Systems
Vassos Hadzilacos Xing Hu Sam Toueg
Department of Computer Science
University of Toronto
Canada
June 19, 2020
Abstract
Motivated by recent distributed systems technology, Aguilera et al. introduced a hybrid model of
distributed computing, called message-and-memory model or m&m model for short [1]. In this model,
processes can communicate by message passing and also by accessing some shared memory. We first
consider the basic problem of implementing an atomic single-writer multi-reader (SWMR) register
shared by all the processes in m&m systems. Specifically, we give an algorithm that implements
such a register in m&m systems and show that it is optimal in the number of process crashes that it
can tolerate. This generalizes the well-known implementation of an atomic SWMR register in a pure
message-passing system [5]. We then combine our register implementation for m&m systems with
the well-known randomized consensus algorithm of Aspnes and Herlihy [4], and obtain a randomized
consensus algorithm for m&m systems that is also optimal in the number of process crashes that it
can tolerate.
1 Introduction
Motivated by recent distributed systems technology [11, 18, 19, 25, 29], Aguilera et al. introduced a
hybrid model of distributed computing, called message-and-memory model or m&m model for short [1].
In this model processes can communicate by message passing and also by accessing some shared memory.
Since it is impractical to share memory among all processes in large distributed systems [9, 20, 21, 31],
the m&m model allows us to specify which subsets of processes share which sets of registers. Among
other results, Aguilera et al. show that it is possible to leverage the advantages of the two communication
mechanisms (message-passing and share-memory) to improve the fault-tolerance of randomized consensus
algorithms compared to a pure message-passing system.
In this paper, we first consider the basic problem of implementing an atomic single-writer multi-reader
(SWMR) register shared by all the processes in m&m systems, and we give an algorithm that is optimal
in the number of process crashes that it can tolerate. This generalizes the well-known implementation
of an atomic SWMR register in a pure message-passing system [5]. We then combine our register
implementation for m&m systems with the randomized consensus algorithm of Aspnes and Herlihy [4],
and obtain a randomized consensus algorithm for m&m systems that is also optimal in the number of
process crashes that it can tolerate. We now describe our results in more detail.
A general m&m system SL is specified by a set of n asynchronous processes that can send messages
to each other over asynchronous reliable links, and by a collection L = {S1, S2, . . . , Sm} where each Si
is a subset of processes: for each Si, there is a set of atomic registers that can be shared by processes in
Si and only by them. Even though the m&m model allows the collection L to be arbitrary, in practice
hardware technology imposes a structure on L [9, 20]: for processes to share memory, they must establish
a connection between them (e.g., an RDMA connection). These connections are naturally modelled by an
undirected shared-memory graph G whose nodes are the processes and whose edges are shared-memory
connections [1]. Such a graph G defines what Aguilera et al. call a uniform m&m system SG, where each
process has atomic registers that it can share with its neighbours in G (and only with them). Note that
SG is the instance of the general m&m system SL with L = {S1, S2, . . . , Sn} where each Si consists of a
1
ar
X
iv
:1
90
6.
00
29
8v
2 
 [c
s.D
C]
  1
7 J
un
 20
20
process and its neighbours in G. Furthermore, if G is the trivial graph with n nodes but no edges, the
m&m system SG that G induces is just a pure message-passing asynchronous system with n processes.
We consider the implementation of an atomic SWMR register R, shared by all the processes, in both
general and uniform m&m systems. For each general m&m system SL, we determine the maximum
number of crashes tL for which it is possible to implement R in SL: we give an algorithm that tolerates
tL crashes and prove that no algorithm can tolerate more than tL crashes. Similarly, for each shared-
memory graph G and its corresponding uniform m&m system SG, we use the topology of G to determine
the maximum number of crashes tG for which it is possible to implement R in SG. By specifying tG in
terms of the topology of G, one can leverage results from graph theory to design m&m systems that can
implement R with high fault tolerance and relatively few RDMA connections per process. For example,
it allows us to design an m&m system with 50 processes that can implement a wait-free R (i.e., this
implementation can tolerate any number of process crashes) with only 7 RDMA connections per process;
as explained in Section 4, this is optimal in some precise sense.
We then show how to solve randomized consensus in m&m systems by substituting the atomic SWMR
registers used by the randomized consensus algorithm of [4] with our implementation of such registers
for m&m systems. Note that, in general, replacing the atomic registers of a randomized algorithm with
(linearizable) implementations of such registers may result in an algorithm that does not work against a
strong adversary [13]. We use a recent result in [14]1 to show that our randomized consensus algorithm
for m&m systems does work against a strong adversary. Finally, we note that our algorithm tolerates
more failures than the one given in [1], and that, in fact, it is optimal in the number of processes crashes
that it can tolerate.
An important remark is now in order. In this paper we consider RDMA systems where process crashes
do not affect the accessibility of the shared registers of that system. This is the case in systems where
the CPU, the DRAM (main memory), and the NIC (Network Interface Controller) are separate entities:
for example, in the InfiniBand cluster evaluated in [28], the crash of a CPU, and of the processes that
it hosts, does not prevent other processes from accessing its DRAM because it can use the NIC without
involving the CPU; see also [9, 12, 33].
2 Model outline
We consider m&m systems with a set of n asynchronous processes Π = {p1, p2, . . . , pn} that may crash.
To define these systems, we first recall the definition of atomic SWMR registers, and what it means to
implement such registers.
2.1 Atomic SWMR registers
A register R is atomic if its read and write operations are instantaneous (i.e., indivisible); each read
must return the value of the last write that precedes it, or the initial value of R if no such write exists.
A SWMR register R is shared by a set S ⊆ Π of processes if it can be written (sequentially) by exactly
one process w ∈ S and can be read by all processes in S; we say that w is the writer of R [24].
2.2 Implementation of atomic SWMR registers
As in [5], we are interested in implementing atomic SWMR registers. By implementation, we mean a
linearizable implementation of such registers, as we now explain. In an object implementation, each
operation spans an interval that starts with an invocation and terminates with a response.
Definition 1. Let o and o′ be any two operations.
• o precedes o′ if the response of o occurs before the invocation of o′.
• o is concurrent with o′ if neither precedes the other.
Roughly speaking, an object implementation is linearizable [15] if, although operations can be con-
current, operations behave as if they occur in a sequential order (called “linearization order”) that is
consistent with the order in which operations actually occur: if an operation o precedes an operation o′,
then o is before o′ in the linearization order (the precise definition is given in [15]).
1In that paper, it is proved that the algorithm in [4] works against a strong adversary even with regular registers [24].
2
It is well-known that linearizable implementations of atomic SWMR registers are characterized by
two simple properties. To define these properties, assume, without loss of generality, that the values
successively written by the single writer w of a SWMR register R are distinct, and different from the
initial value of R.2 Let v0 denote the initial value of R, and vk denote the value written by the k-th write
operation of w. We say that a write operation w immediately precedes a read operation r if w precedes
r, and there is no write operation w′ such that w precedes w′ and w′ precedes r. An atomic SWMR
register implementation is linearizable if and only if it satisfies the following two properties.
Property 1. If a read operation r returns the value v then:
• there is a write v operation that immediately precedes r or is concurrent with r, or
• no write operation precedes r and v = v0.
Property 2. If two read operations r and r′ return values vk and vk′ , respectively, and r precedes r′,
then k ≤ k′.
Henceforth, by “implementation of an atomic SWMR register”, we mean a linearizable implementa-
tion of such a register, i.e., one that satisfies the above two properties.
2.3 General m&m systems
Let L = {S1, S2, . . . , Sm} be any bag of non-empty subsets of Π = {p1, p2, . . . , pn}.
Definition 2. ML is the class of m&m systems (induced by L), each consisting of:
1. The processes in Π.
2. Reliable asynchronous communication links between every pair of processes in Π.
3. The following set of registers: For each subset of processes Si in L, a non-empty set of atomic
registers that are shared by the processes in Si (and only by them).
Note that ML includes m&m systems that differ by the type and number of registers shared by the
processes in each Si; for example they could be sharing multi-writer multi-reader atomic registers.
Since we are interested in implementing atomic SWMR registers (shared by all processes in the
system), here we focus on an m&m system ofML in which the set of registers shared by the processes in
each set Si are atomic SWMR registers. More precisely, we focus on the m&m system SL defined below:
Definition 3. The general m&m system SL (induced by L) consists of:
1. The processes in Π.
2. Reliable asynchronous communication links between every pair of processes in Π.
3. The following set of registers: For each subset of processes Si in L and each process p ∈ Si, an
atomic SWMR register, denoted Ri[p], that can be written by p and read by all processes in Si (and
only by them).
In this paper, for every L, we give an algorithm that implements atomic SWMR registers shared
by all processes in the m&m system SL, and we show that this algorithm is optimal in the number of
process crashes that can be tolerated. In fact we prove that any algorithm that implements such registers
in any m&m system in ML (not only in SL) cannot tolerate more crashes. This justifies our focus on
SL: considering members of ML with more or stronger registers than SL does not improve the fault
tolerance of implementing atomic SWMR registers shared by all.
Without loss of generality we assume the following:
Assumption 4. The bag L = {S1, S2, . . . , Sm} of subsets of Π = {p1, p2, . . . , pn} is such that every
process in Π is in at least one of the subsets Sj of L.
2This can be ensured by the writer w writing values of the form 〈sn, v〉 where sn is the value of a counter that w
increments before each write.
3
p1 p5
p4
p3
p2
Figure 1: A graph G
p1 p5
p4
p3
p2
p1 p2
p1 p2 p3
p3 p4 p5
p3 p4 p5
p2 p3 p4 p5
R1 R5
R2
R3
R4
Figure 2: The uniform m&m system SG induced by G
This assumption can be made without loss of generality because it does not strengthen the system
SL induced by L. In fact, given a bag L that does not satisfy the above assumption, we can construct a
bag that satisfies the assumption as follows: for every process pi in Π that is not contained in any Sj of
L, we can add the singleton set {pi} to L. Let L′ be the resulting bag. By Definition 3(3) above, adding
{pi} to L results in adding only a local register to the induced system SL, namely, an atomic register
that pi (trivially) shares only with itself. So SL′ is just SL with some additional local registers. Note
that a pure message-passing system (with no shared memory) with n processes p1, p2, . . . , pn is modelled
by the system SL where L = {{p1}, {p2}, . . . , {pn}}.
2.4 Uniform m&m systems
Let G = (V,E) be an undirected graph such that V = Π, i.e., the nodes of G are the n processes
p1, p2, . . . , pn of the system. For each pi ∈ V , let N(pi) = {pj | (pi, pj) ∈ E} be the neighbours of pi in
G, and let N+(pi) = N(pi) ∪ {pi}.
Definition 5. The uniform m&m system SG (induced by G) is the m&m system SL where L =
{S1, S2, . . . , Sn} with Si = N+(pi).3
The graph G induces the uniform m&m system SG where processes can communicate by message
passing (via reliable asynchronous communication links), and also by shared memory as follows: for each
process pi, and every neighbour p of pi in G (including pi) there is an atomic SWMR register Ri[p] that
can be written by p and read by every neighbour of pi in G (including pi). We can think of the registers
Ri[∗] as being physically located in the DRAM of the host of pi, and every neighbour of pi accessing
these registers over its RDMA connection to pi (which is modelled by an edge of G).
4
For example, in Figures 1 and 2 we show a graph G and the uniform m&m system SG induced by
G. Here G has five nodes representing processes p1, p2, p3, p4, p5; the edges of G represent the RDMA
connections that allow these processes to share registers. The uniform m&m system SG induced by
G is the system SL for L = {S1, S2, S3, S4, S5} where each Si consists of pi and its neighbours in G:
specifically, S1 = {p1, p2}, S2 = {p1, p2, p3}, S3 = {p2, p3, p4, p5}, and S4 = S5 = {p3, p4, p5}. The box
adjacent to each process pi in SG represents the atomic SWMR registers that are shared among pi and
its neighbours in G (intuitively these registers are located at pi’s host). For example, in the box adjacent
to process p2, the component labelled p1 represents the register R2[p1] that can be written by p1 and
read by all the neighbours of p2 in G, namely p1, p2, and p3. Similarly, registers R2[p2] and R2[p3] can
be written by p2 and p3, respectively, and read by p1, p2, and p3. The dashed lines in Figure 2 represent
the asynchronous message-passing links between the processes of SG.
3Note that L satisfies Assumption 4 because each Si = N
+(pi) contains pi.
4As we mentioned in the introduction, we assume that the crash of pi does not prevent the neighbours of pi from
accessing the shared registers Ri[∗].
4
3 Atomic SWMR register implementation in general m&m sys-
tems
Let SL be the general m&m system induced by a bag L = {S1, . . . , Sm} of subsets of Π = {p1, p2, . . . , pn}.
Recall that in system SL, for every Si in L, the processes in Si share some atomic SWMR registers that
can be read only by the processes in Si. In this section, we determine the maximum number of process
crashes tL that may occur in SL such that it is possible to implement in SL a shared atomic SWMR
register readable by all processes in SL. Intuitively, from the definition of tL: (a) if t ≤ tL processes
may crash, then any two subsets of processes of size n− t either intersect, or they each contain a process
that can communicate with the other via a shared SWMR register that it can write and the other can
read; and (b) if t > tL processes may crash, then there are two subsets of processes of size n− t that are
disjoint and cannot communicate via shared SWMR register.
Definition 6. Given a bag L = {S1, . . . , Sm} of subsets of Π = {p1, p2, . . . , pn}, tL is the maximum
integer t such that the following condition holds: For all disjoint subsets P and P ′ of Π of size n − t
each, some set Si in L contains both a process in P and a process in P
′.
Note that if t ≤ b(n−1)/2c then there are no disjoint subsets P and P ′ of Π of size n− t each, and so the
above condition is vacuously true. Therefore tL ≥ b(n − 1)/2c. Recall that for a pure message-passing
system, L = {{p1}, {p2}, . . . , {pn}}, so in this system tL = b(n− 1)/2c.
To illustrate Definition 6, suppose Π = {p1, p2, p3, p4, p5} and L = {S1, S2, S3} where S1 = {p1, p2},
S2 = {p4, p5}, and S3 = {p2, p3, p4}. By the definition of tL: (1) tL ≥ 3 because for any two disjoint
subsets of Π of size 5−3 = 2 each, there exists a set Si in L that intersects both subsets; e.g., for subsets
{p1, p5} and {p3, p4}, the set S2 = {p4, p5} intersects both of them. (2) tL < 4 because there are two
disjoint subsets {p1}, {p5} of size 5 − 4 = 1 each, such that no set Si in L contains both p1 and p5. So
in this example n = 5 and tL = 3 > b(n− 1)/2c = 2.
We now prove that in the general m&m system SL, it is possible to implement an atomic SWMR register
readable by all processes if and only if at most tL processes may crash in SL. More precisely:
Theorem 7. Let SL be the general m&m system induced by a bag L = {S1, . . . , Sm} of subsets of
Π = {p1, p2, . . . , pn}.
• If at most tL processes crash in SL, then for every process w in SL, it is possible to implement an
atomic SWMR register writable by w and readable by all processes in SL.
• If more than tL processes crash in SL, then for some process w in SL, it is impossible to implement
an atomic SWMR register writable by w and readable by all processes in SL.
The above theorem is a direct corollary of Theorem 17 (Section 3.1) and Theorem 18 (Section 3.2).
3.1 Algorithm
We now show how to implement an atomic SWMR register R, that can be written by an arbitrary fixed
process w and read by all processes, in an m&m system SL where at most tL processes may crash. This
implementation is given in terms of the procedures Write() and Read() shown in Algorithm 1.
Without loss of generality we assume that for all i ≥ 1, the i-th value that the writer writes is of the
form 〈i, val〉, and the initial value of the register R is 〈0, u0〉. To write 〈i, val〉 into R, the writer w calls
the procedure Write(〈i, val〉). To read R, a process q calls the procedure Read() that returns a value
of the form 〈i, val〉. The sequence number i makes the values written to R unique.
Algorithm 1 generalizes the well-known ABD implementation of an atomic SWMR register in pure
message-passing systems by Attiya, Bar-Noy and Dolev [5].5 To write a new value into R, the writer w
sends messages to all processes asking them to write the new value into all the shared SWMR registers
that they can write in SL. The writer then waits for acknowledgments from n− tL processes indicating
that they have done so. To read R, a process sends messages to all processes asking them for the most
up-to-date value that they can find in all the shared SWMR registers that they can read. The reader
waits for n− tL responses, selects the most up-to-date value among them, writes back that value (using
the same procedure that the writer uses), and returns it. From the definition of tL it follows that every
write of R “intersects” with every read of R at some shared SWMR register of SL. Note that since at
most tL processes crash, the waiting mentioned above does not block any process.
5The ABD algorithm is the special case of Algorithm 1 for SL where L = {{p1}, {p2}, . . . , {pn}}.
5
Algorithm 1 Implementation of an atomic SWMR register writeable by process w and readable by all
processes in SL, provided that at most tL processes crash.
Shared variables
For all Si ∈ L and all p ∈ Si:
Ri[p] : atomic SWMR register writeable by p and readable by every process in Si ∈ L;
initialized to 〈0, u0〉.
Write(〈snw, u〉): . executed by the writer w
1: send 〈W, 〈snw, u〉〉 to every process p in SL
2: wait for 〈ACK-W, snw〉 messages from n− tL distinct processes
3: return
. executed by every process p in SL
4: upon receipt of a 〈W, 〈snw, u〉〉 message from process w:
5: for every i in {1, . . . ,m} such that p ∈ Si do
6: 〈sn,−〉 ← Ri[p]
7: if snw > sn then
8: Ri[p]← 〈snw, u〉
9: send 〈ACK-W, snw〉 to process w
Read(): . executed by any process q
10: snr ← snr + 1
11: send 〈R, snr〉 to every process p in SL
12: wait for 〈ACK-R, snr, 〈−,−〉〉 messages from n− tL distinct processes
13: 〈seq, val〉 ← max{〈r sn, r u〉 | received a 〈ACK-R, snr, 〈r sn, r u〉〉 message}
14: Write(〈seq, val〉)
15: return 〈seq, val〉
. executed by every process p in SL
16: upon receipt of a 〈R, snr〉 message from a process q:
17: 〈r sn, r u〉 ← max{〈sn, u〉 | ∃i ∈ {1, . . . ,m} : p ∈ Si and ∃p′ ∈ Si : Ri[p′] = 〈sn, u〉}
18: send 〈ACK-R, snr, 〈r sn, r u〉〉 to process q
We now show that the procedure Write(), called by the writer w, and the procedure Read(), called
by any process q in SL, implement an atomic SWMR register R. To do so, we show that the calls of
Write() by w and of Read() by any process satisfy Properties 1 and 2 of atomic SWMR registers given
in Section 2.2.
Definition 8. The operation write(v) is the execution of Write(v) by the writer w for some tuple
v = 〈snw, u〉: this operation starts when w calls Write(v) and it completes if and when this call returns.
An operation read(v) is an execution of Read() that returns v to some process q: this operation starts
when q calls Read() and it completes when this call returns v to q.
Let v0 = 〈0, u0〉 be the initial value of the implemented register R, and, for k ≥ 1, let vk = 〈k,−〉
denote the k-th value written by the writer w on R. Note that all vk’s are distinct: for all i 6= j ≥ 0, vi 6=
vj .
Let SL be the general m&m system induced by a bag L = {S1, . . . , Sm} of subsets of Π = {p1, p2, . . . , pn}.
To prove the correctness of the SWMR implementation shown in Algorithm 1, we now consider an arbi-
trary execution of this implementation in SL.
Lemma 9. If at most tL processes crash, then any read(−) or write(−) operation executed by a process
that does not crash completes.
Proof. The only statements that could prevent the completion of a read(−) or write(−) operation are the
wait statements of line 2 and line 12. But since communication links are reliable, these wait statements
6
are for n− tL acknowledgements, and at most tL processes out of the n processes of SL may crash, it is
clear that these wait statements cannot block.
The proofs of the next two lemmas are straightforward and therefore omitted. The first one states
that every read operation returns some vk for k ≥ 0.
Lemma 10. If r is a read(v) operation in the execution, then v = vk for some k ≥ 0.
The next lemma says that no read operation can read a “future” value, i.e., a value that is written
after the read completes.
Lemma 11. If r is a read(v) operation in the execution, then either v = v0, or v = vk such that the
operation write(vk) precedes r or is concurrent with r.
Note that the guard in lines 7-8 (which is the only place where the shared SWMR registers are
updated) ensures that the content of each shared SWMR register in SL is non-decreasing in the following
sense:
Observation 12. [Register monotonicity] For all 1 ≤ i ≤ m and all p ∈ Si, if Ri[p] = 〈k,−〉 at some
time t and Ri[p] = 〈k′,−〉 at some time t′ ≥ t then k′ ≥ k.
Lemma 13. For all k ≥ 1, if a call to the procedure Write(vk) returns before a read(v) operation
starts, then v = v` for some ` ≥ k.
Proof. Suppose a call to Write(vk) returns before a read(v) operation starts; we must show that v = v`
with ` ≥ k. Note that before this call of Write(vk) returns, 〈ACK-W, k〉 messages are received from
a set P of n − tL distinct processes (see line 2 of the Write() procedure). From lines 5-8, which are
executed before these 〈ACK-W, k〉 messages are sent, and by Observation 12, it is clear that the following
holds:
Claim 13.1. By the time Write(vk) returns, every shared SWMR register in SL that can be written by
a process in P contains a tuple 〈k′,−〉 with k′ ≥ k.
Now consider the read(v) operation, say it is by process q. Recall that read(v) is an execution of the
Read() procedure that returns v to q. When q calls Read(), it increments a local counter snr and asks
every process p in SL to do the following: (a) read every SWMR register that p can read, and (b) reply
to q with a 〈ACK-R, snr, 〈r sn, r u〉〉 message such that 〈r sn, r u〉 is the tuple with the maximum r sn
that p read. By line 12 of the Read() procedure, q waits to receive such 〈ACK-R, snr, 〈−,−〉〉 messages
from a set P ′ of n− tL distinct processes, and q uses these messages to select the value v as follows:
v ← max{(r sn, r u) | q received some 〈ACK-R, snr, 〈r sn, r u〉〉 from a process in P ′}
Thus, by Lemma 10, it is clear that:
Claim 13.2. v = v` where ` = max{j | q received a 〈ACK-R, snr, 〈j,−〉〉 message from a process in P ′}.
Claim 13.3. Some set Si in L contains both a process in P and a process in P
′.
Proof of Claim 13.3. If P and P ′ are disjoint, the claim follows directly from Definition 6 of tL. If P
and P ′ intersect, let p be a process in both P and P ′. By Assumption 4, p is in some set Si in L, and
the claim follows.
By the above claim, some set Si in L contains a process p in P and a process p
′ in P ′. Since p ∈ Si and
p′ ∈ Si, Ri[p] is one of the SWMR registers that can be written by p and read by p′. From Claim 13.1,
by the time the call to Write(vk) returns, Ri[p] contains a tuple 〈k′,−〉 such that k′ ≥ k (*). Since
p′ ∈ P ′, during the execution of read(v) by q, p′ reads all the shared SWMR registers that it can read,
including Ri[p]. Since read(v) starts after Write(vk) returns, p
′ reads Ri[p] after Write(vk) returns.
Thus, by (*) and the monotonicity of Ri[p] (Observation 12), p
′ reads from Ri[p] a tuple 〈r sn,−〉 with
r sn ≥ k′ ≥ k. Then p′ selects the tuple 〈j,−〉 with the maximum sn among all the 〈sn,−〉 tuples that
it read (see line 17); note that j ≥ k. So the 〈ACK-R, snr, 〈j,−〉〉 message that p′ sends to q, and q uses
to select v, is such that j ≥ k. So, by Claim 13.2, v = v` such that ` ≥ j ≥ k.
Lemma 13 immediately implies the following:
7
Corollary 14. For all k ≥ 1, if a write(vk) operation precedes a read(v) operation then v = v` with
` ≥ k.
We now show that Algorithm 1 satisfies Properties 1 and 2 of atomic SWMR registers.
Lemma 15. The write(−) and read(−) operations satisfy Property 1.
Proof. Suppose for contradiction that Property 1 does not hold. Thus there is a read operation r =
read(v) such that:
(a) there is no write(v) operation that immediately precedes r or is concurrent with r, and
(b) some write(−) operation precedes r, or v 6= v0.
There are two cases.
1. v = v0. By (b) above, some write(−) operation, say write(vk), precedes r. Thus write(vk) precedes
read(v0). Since k ≥ 1 this contradicts Corollary 14.
2. v 6= v0. By Lemma 11, v = vk such that the operation write(vk) precedes r, or write(vk) is
concurrent with r. By (a) above, write(vk) does not immediately precede r, and write(vk) is not
concurrent with r. Thus, write(vk) precedes, but not immediately, r. Let write(vk′) be the write
operation that immediately precedes r. Note that write(vk) precedes write(vk′), so k < k
′. Since
write(vk′) precedes r = read(v), by Corollary 14, v = v` with ` ≥ k′, so ` > k. This contradicts
that v = vk.
Since both cases lead to a contradiction, Property 1 holds.
Lemma 16. The write(−) and read(−) operations satisfy Property 2.
Proof. We have to show that if a read(vk) operation precedes a read(vk′) operation then k ≤ k′. Suppose
read(vk) precedes read(vk′). Note that during the read(vk) operation, namely in line 14, there is a call
to the procedure Write(vk) which returns before the read(vk) operation completes. So this call to
Write(vk) returns before the read(vk′) operation starts. By Lemma 13, k ≤ k′.
Lemmas 9, 15 and 16 immediately imply:
Theorem 17. Let SL be the general m&m system induced by a bag L = {S1, . . . , Sm} of subsets of
Π = {p1, p2, . . . , pn}. If at most tL processes crash in SL, for every process w in SL, Algorithm 1
implements an atomic SWMR register writable by w and readable by all processes in SL.
3.2 Lower bound
Theorem 18. Let SL be the general m&m system induced by a bag L = {S1, . . . , Sm} of subsets of
Π = {p1, p2, . . . , pn}. If more than tL processes crash in SL, for some process w in SL, there is no
algorithm that implements an atomic SWMR register writable by w and readable by all processes in SL.
Proof. Let SL be the general m&m system induced by a bag L = {S1, . . . , Sm} of subsets of Π =
{p1, p2, . . . , pn}. Suppose for contradiction that t > tL processes can crash in SL, but for every process w
in SL, there is an algorithm Aw that implements an atomic SWMR register writable by w and readable
by all processes in SL (*).
2020-05-05 Xing Hu - OPODIS 2019 82
P P’
Q = !-P-P’
n-t n-t
No Si contains both
a process in P and a process in P’
t > tL⟹ there are two sets of processes P and P’ of size n-t 
• P and P’ are disjoint
• no Si contains both a process in P and a process in P’
Figure 3: Partition of Π
Since t > tL, by the Definition 6 of tL there are two disjoint subsets P and P
′ of Π, of size n − t
each, such that: no set Si in L contains both a pr cess in P a a process in P
′ (**). Since P and P ′
are disjoint, the sets P , P ′, and Q = Π− (P ∪ P ′) form a partition of Π (see Figure 3).
8
Let the writer w be any process in P . Let A be an algorithm that tolerates t > tL process crashes in
SL and implements an atomic SWMR register R that is writable by w and readable by all processes in
SL; this algorithm exists by our initial assumption (*).
Since |P ∪Q ∪ P ′| = n, clearly |P ∪Q| = |P ′ ∪Q| = n − (n − t) = t. Since algorithm A tolerates t
crashes, it works correctly in every execution in which all the processes in P ∪Q or in P ′ ∪Q crash.
We now define three executions E1, E2, and E3 of algorithm A. These are illustrated in Figure 4.
E1
E3
write(v)
write(v)
read
w
r
w
r
tw
e tr
etr
stw
s
E2 read
w
r
tw
e tr
etr
stw
s
tw
e tr
etr
stw
s
Figure 4: Scenarios for Theorem 18
Execution E1 of algorithm A is defined as follows:
• The processes in P ′ ∪Q crash from the beginning of the execution; they take no steps in E1.
• At some time tsw the writer w starts an operation to write the value v into the implemented register
R, for some v 6= v0, where v0 is the initial value of R. Since the number of processes that crash
in E1 is |P ′ ∪ Q| = t, and the algorithm A tolerates t crashes, this write operation eventually
terminates, say at time tew.
• After this write terminates, no process takes a step up to and including some time tsr > tew.
Note that in E1, processes in P are the only ones that take steps up to time t
s
r.
Execution E2 of algorithm A is defined as follows:
• The processes in P ∪Q crash from the beginning of the execution; they take no steps in E2.
• At time tsr, some process r ∈ P ′ starts a read operation on the implemented register R. Since the
number of processes that crash in E2 is |P ∪Q| = t, and the algorithm A tolerates t crashes, this
read operation terminates, say at time ter.
Since no write operation precedes the read operation in E2, Property 1 of atomic SWMR registers
implies:
Claim 18.1. At time ter in E2 the read operation returns the initial value v0 of R.
We now construct an execution E3 of the algorithm A that merges E1 and E2, and contradicts the
atomicity of the implemented R. E3 is identical to E1 up to time t
s
r, and it is identical to E2 from time
9
tsr to t
e
r (note that in E3 processes in Q can only take steps after time t
e
r). To obtain this merged run
E3, intuitively we delay the messages sent by processes in P to processes in P
′ to after time ter, and we
also use the fact that processes in P ′ cannot read any of the shared registers in SL that processes in P
may have written by time tsr (this is because of (**)).
Claim 18.2. There is an execution E3 of algorithm A such that
(a) up to and including time tew, E3 is indistinguishable from E1 to all processes.
(b) up to and including time ter, E3 is indistinguishable from E2 to all processes in P
′.
(c) No process crashes in E3.
Proof of Claim 18.2. Until time tsr, E3 is identical to E1. We now show that it is possible to extend E3
in the time interval [tsr, t
e
r] with the sequence of steps that the processes in P
′ executed during the same
time interval in E2.
6 More precisely, let s1, s2, . . . , s` be the sequence of steps executed during the time
interval [tsr, t
e
r] in E2. Since only processes in P
′ take steps in E2, s1, s2, . . . , s` are all steps of processes
in P ′. Let C02 be the configuration of the system SL at time tsr in E2,7 and let Ci2 be the configuration
that results from applying step si to configuration Ci−12 , for all i such that 1 ≤ i ≤ `. We will prove that
there are configurations C03 , C
1
3 , . . . , C
`
3 of SL extending E3 at time tsr such that:
(i) every process in P ′ has the same state in Ci3 as in C
i
2;
(ii) the set of messages sent by processes in P ′ to processes in P ′, but not yet received, is the same in
Ci3 as in C
i
2;
(iii) every shared register readable by processes in P ′ has the same value in Ci3 as in C
i
2; and
(iv) if i 6= 0, Ci3 is the result of applying step si to configuration Ci−13 .
This is shown by induction on i.
For the basis of the induction, i = 0, we take C03 to be the configuration of the system just before time
tsr in E3. Since no process in P
′ takes a step before time tsr in either E2 or E3, C
0
3 satisfies properties (i)
and (ii).
Claim 18.3. At time tsr in E3 the shared registers that can be read by processes in P
′ have their initial
values.
Proof of Claim 18.3. Suppose, for contradiction, that at time tsr in E3, some shared register R that can
be read by a process p′ in P ′ does not have its initial value. By construction, E3 is identical to E1 until
time tsr, and so only processes in P take steps before time t
s
r in E3. Thus, register R was written by some
process p in P by time tsr in E3. Since R is readable by p
′ ∈ P ′ and is written by p ∈ P , R is shared
by both p and p′. Thus, there must be a set Si in L that contains both p and p′ — a contradiction to
(**).
By Claim 18.3, the shared registers readable by processes in P ′ have the same value (namely, their
initial value) in C03 as in C
0
2 . So, C
0
3 also satisfies property (iii). Property (iv) is vacuously true for i = 0.
For the induction step, for each i such that 1 ≤ i ≤ `, we consider separately the cases of si being
a step to send a message, receive a message, write a shared register, and read a shared register. In
each case, it is easy to verify that, assuming (inductively) that Ci−13 has properties (i)–(iv), step s
i is
applicable to Ci−13 , and the resulting configuration C
i
3 has properties (i)–(iv).
To complete the definition of E3, after time t
e
r we let processes take steps in round-robin fashion.
Whenever a process’s step is to receive a message, it receives the oldest one sent to it; this ensures
that all messages are eventually received. Processes continue taking steps in this fashion according to
algorithm A.
Since E3 is identical to E1 up to and including time t
e
w, E3 is indistinguishable from E1 up to and
including time tew to all processes in P . This proves part (a) of the claim.
Note that in E3 and E2, the processes in P
′: (a) take no steps before time tsr, and (b) during the
time interval [tsr, t
e
r], they execute exactly the same sequence of steps, and go through the same sequence
6A step of A executed by process p is one of the following: p sending or receiving a message, or p applying a write or a
read operation to a shared register in SL.
7The configuration of SL at time t in execution E consists of the state of each process, the set of messages sent but not
yet received, and the value of each shared register in SL at time t in E.
10
S S
S
S
S
Figure 5: A graph G
S S
S
S
S
Figure 6: The square G2 of graph G
of states. Thus, up to and including time ter, E3 is indistinguishable from E2 to all processes in P
′. This
proves part (b) of the claim.
Finally, every process takes steps as required by the algorithm in E3, so no process crashes. This
proves part (c) of the claim.
By Claim 18.2(a), up to and including time tew, E3 is indistinguishable from E1 to the writer w ∈ P .
So E3 contains the write operation that writes v 6= v0 into R, which starts at time tsw and ends at time
tew. By Claim 18.2(b), up to and including time t
e
r, E3 is indistinguishable from E2 to r ∈ P ′. So E3
contains the read operation that returns v0, which starts at time t
s
r and ends at time t
e
r. Since t
e
w < t
s
r,
this read operation violates Property 1 of atomic SWMR registers. As there are no process crashes in E3
(by Claim 18.2(c)), this contradicts the assumption that A is an implementation of an atomic SWMR
register R that tolerates t > tL crashes.
Note that the proof of Theorem 18 does not depend on the type or number of registers shared by
the processes in each set Si of the bag L. So the result of this theorem applies not only to SL but also
to every m&m system in ML. In fact, the proof of Theorem 18 does not even depend on the type of
objects that are shared by the processes in each set Si; for example these objects could include queues,
stacks, and consensus objects. Hence we have the following stronger result:
Theorem 19. Consider any m&m system S induced by a bag L = {S1, . . . , Sm} of subsets of Π =
{p1, p2, . . . , pn}, where the processes in each Si share any number of arbitrary objects among themselves
(and only among themselves). If more than tL processes crash in S, then for some process w in S, there
is no algorithm that implements an atomic SWMR register writable by w and readable by all processes
in S.
4 Atomic SWMR register implementation in uniform m&m
systems
Let G = (V,E) be an undirected graph where V = {p1, p2, . . . , pn}, i.e., the nodes of G are the processes
p1, p2, . . . , pn. Let SG be the uniform m&m system induced by G. Recall that in SG, each process pi
and its neighbours in G share some atomic SWMR registers that can be read by (and only by) them.
We now use G to determine the maximum number of process crashes that may occur in SG such that
it is possible to implement a shared atomic SWMR register readable by all processes in SG. To do so,
we first recall the definition of the square of the graph G: G2 = (V,E′) where E′ = {(u, v) | (u, v) ∈ E
or ∃k ∈ V such that (u, k) ∈ E and (k, v) ∈ E}.
Definition 20. Given an undirected graph G = (V,E) such that V = {p1, p2, . . . , pn}, tG is the maximum
integer t such that the following condition holds: For all disjoint subsets P and P ′ of V of size n − t
each, some edge in G2 connects a node in P with a node in P ′; i.e., G2 has an edge (u, v) such that
u ∈ P and v ∈ P ′.
Note that for every undirected graphs G, tG ≥ b(n − 1)/2c. Moreover, in a pure message-passing
system (where G and G2 have no edges) tG = b(n− 1)/2c.
To illustrate the above definition of tG, consider the graph G in Figure 5 where V = {p1, p2, p3, p4, p5}.
Figure 6 shows the corresponding G2 graph. By the above definition of tG: (a) tG ≥ 3 because for any
11
two disjoint subsets of V of size 5− 3 = 2 each, G2 has an edge that “connects” these two subsets (e.g.,
for subsets P = {p1, p2} and P ′ = {p4, p5}, the edge (p2, p4) of G2 connects a node of P to a node of
P ′), and (b) tG < 4 because there are two disjoint subsets {p1}, {p5} of size 5− 4 = 1 each, such that no
edge in G2 connects p1 and p5. So in this example n = 5 and tG = 3.
In Theorem 22 given below, we show that in the uniform m&m system SG induced by a graph G, it
is possible to implement an atomic SWMR register readable by all processes if and only if at most tG
processes may crash in SG.
For example, consider the uniform m&m system SG of 5 processes induced by the graph G in Figure
5. In addition to message-passing links, SG has 3 pairwise RDMA connections. Since tG = 3, by
Theorem 22, we can implement an atomic SWMR register readable by all 5 processes of SG if and only
if at most 3 of them may crash. In contrast, in a pure message-passing system with 5 processes, no
implementation of such a register can tolerate more than 2 process crashes.
To prove Theorem 22 we first show:
Lemma 21. Let SG be the uniform m&m system induced by an undirected graph
G = (V,E) where V = {p1, p2, . . . , pn}. Let SL be the general m&m system such that SL = SG. Then
tG = tL.
Proof. By Definition 5, SL is the general m&m system where L = {S1, S2, . . . , Sn} such that Si = N+(pi),
i.e., for all i, 1 ≤ i ≤ n, Si is the set of neighbours of pi (including pi) in the graph G. Recall that tL
is the maximum t such that for all disjoint subsets P and P ′ of V of size n − t each, some set Si in L
contains both a node in P and a node in P ′.
From the definitions of tG (Definition 20) and tL, it is clear that to prove that tG = tL it suffices to
show that for all disjoint subsets P and P ′ of V of size n− t each, the following holds: some edge in G2
connects a node in P with a node in P ′ if and only if some set Si in L contains both a node in P and a
node in P ′.
[Only If] Suppose G2 has an edge (pi, pj) such that pi ∈ P and pj ∈ P ′; since P and P ′ are disjoint, pi
and pj are distinct. By definition of G
2, there are two cases:
1. (pi, pj) ∈ E. In this case, pj ∈ N+(pi) and pi ∈ N+(pi). So the set Si = N+(pi) in L contains
both node pi ∈ P and node pj ∈ P ′.
2. There is a node pk ∈ V such that (pi, pk) ∈ E and (pk, pj) ∈ E. In this case, pi ∈ N+(pk) and
pj ∈ N+(pk). So the set Sk = N+(pk) in L contains both pi ∈ P and pj ∈ P ′.
So in both cases, some set S` in L contains both a node in P and a node in P
′.
[If] Suppose set Sk in L contains both a node pi in P and a node pj in P
′; since P and P ′ are disjoint,
pi and pj are distinct. Recall that Sk = N
+(pk) for node pk ∈ V .
There are two cases:
1. pi, pj and pk are pairwise distinct. In this case, since pi and pj are in Sk = N
+(pk), (pi, pk) and
(pk, pj) are edges of G, i.e., (pi, pk) ∈ E and (pk, pj) ∈ E. Thus, by definition of G2, (pi, pj) is an
edge of G2; this edge connects pi ∈ P and pj ∈ P ′.
2. pk = pi or pk = pj . Without loss of generality, assume that pk = pi. Since pi and pj are in
N+(pk) = N
+(pi), (pi, pj) must be an edge of G, i.e., (pi, pj) ∈ E. Thus, by definition of G2,
(pi, pj) is an edge of G
2; this edge connects pi ∈ P and pj ∈ P ′.
So in both cases, some edge in G2 connects a node in P with a node in P ′.
From Lemma 21 and Theorem 7, we have:
Theorem 22. Let SG be the uniform m&m system induced by an undirected graph
G = (V,E) where V = {p1, p2, . . . , pn}.
• If at most tG processes crash in SG, then for every process w in SG, it is possible to implement an
atomic SWMR register writable by w and readable by all processes in SG.
• If more than tG processes crash in SG, then for some process w in SG, it is impossible to implement
an atomic SWMR register writable by w and readable by all processes in SG.
12
9/6/2019 https://upload.wikimedia.org/wikipedia/commons/e/ef/Hoffman-Singleton_graph.svg
https://upload.wikimedia.org/wikipedia/commons/e/ef/Hoffman-Singleton_graph.svg 1/1
Figure 7: The Hoffman-Singleton graph Figure 8: The Petersen Graph
To illustrate this theorem, we now give three examples. For our first example, consider a pure message-
passing system S with 50 nodes. In S, one can implement an atomic SWMR register R (readable by
all the processes) only if at most 24 process crashes may occur. But if we allow each process of S
to establish 7 pairwise RDMA connections, one can implement R in a way that tolerates any number
of process crashes (i.e., R is wait-free). This is because there is an undirected graph G with n = 50
nodes, each with degree 7, such that G2 has an edge between every pair of nodes (G is the well-known
Hoffman-Singleton graph [16] shown in Figure 7 [32]); so tG = n − 1 = 49, and thus by Theorem 22
it is possible to implement R in the uniform m&m system SG in a way that tolerates up to 49 process
crashes. Some simple graph theory arguments show that this is optimal in two ways: (a) one cannot
implement a wait-free register R that is shared by 50 processes with fewer than 7 RDMA connections per
process (more precisely, with any such implementation, if a process has fewer than 7 RDMA connections
there must be another process with more than 7 RDMA connections), and (b) with at most 7 RDMA
connections per process, one cannot implement a wait-free register R that is shared by more than 50
processes.
As another example, consider the uniform m&m system SG with n = 10 processes and 3 RDMA
connections per process induced by Petersen graph G shown in Figure 8.8 Since G has diameter 2, G2
has an edge between every pair of nodes, and so tG = n−1 = 9. Thus, by Theorem 22, one can implement
an atomic SWMR register R in SG in a way that tolerates up to 9 process crashes. In contrast, in a pure
message-passing system with 10 processes, no implementation of R can tolerate more than 4 process
crashes.
As our last example, we show that expander graphs with high vertex expansion ratio [17] induce
uniform m&m systems that support highly fault-tolerant register implementations. To do so, first recall
the definition of the vertex expansion ratio:
Definition 23. Let G = (V,E) be any undirected graph.
1. The vertex boundary of a subset S ⊆ V is
δS = {v ∈ V : (u, v) ∈ E, u ∈ S, v /∈ S}
2. The vertex expansion ratio of G, denoted h(G), is defined as
h(G) = min
S⊆V :0<|S|≤|V |/2
|δS|
|S|
We now prove that any graph G with high vertex expansion ratio h also has a large tG.
Theorem 24. For any undirected graph G with n nodes and vertex expansion ratio h, tG ≥ d(1 −
1
h2+2h+2 )ne − 1.
Proof. Let G = (V,E) be an undirected graph with n nodes and vertex expansion ratio h, To show
tG ≥ d(1− 1h2+2h+2 )ne − 1, we must show that for every t, 0 ≤ t ≤ d(1− 1h2+2h+2 )ne − 1, the following
holds. For all disjoint subsets P and P ′ of V of size n− t each: (*) some edge in G2 connects a node in
P to a node in P ′.
8As with the Hoffman-Singleton graph, Petersen graph is a Moore Graph with diameter 2 [16].
13
Let t be such that 0 ≤ t ≤ d(1− 1h2+2h+2 )ne − 1. Clearly, 0 ≤ t < (1− 1h2+2h+2 )n. Let P and P ′ be
any two disjoint subsets of V of size m = n− t each. We now show that (*) holds.
There are three cases: (1) |P ∪ δP | ≤ n/2, (2) |P ′ ∪ δP ′| ≤ n/2, or (3) |P ∪ δP | > n/2 and
|P ′ ∪ δP ′| > n/2.
Case 1: |P ∪ δP | ≤ n/2. Since |P | = m ≤ |P ∪ δP | ≤ n/2, by the definition of vertex expansion
ratio h, h ≤ |δP |/|P |. Since |P | = m, we have (h + 1)m ≤ |P ∪ δP |. Thus, again by the definition of
vertex expansion ratio h, (h+ 1)2m ≤ |P ∪ δP ∪ δ(P ∪ δP )|.
By assumption: t < (1− 1
h2 + 2h+ 2
)n
⇒ n
h2 + 2h+ 2
< n− t
⇒ n
h2 + 2h+ 2
< m
⇒ n < (h2 + 2h+ 2)m
⇒ n < (h+ 1)2m+m
Since |P ′| = m, |P∪δP∪δ(P∪δP )| ≥ (h+1)2m, and (h+1)2m+m > n, the sets P ′ and P∪δP∪δ(P∪δP )
intersect. Thus, since P ′ and P are disjoint, there is a node q in P ′ such that: either (i) q is in δP , so it
is connected to a node in P by an edge in G, or (ii) q is in δ(P ∪ δP ), so it is connected to a node in P
by a two-edge path in G. Thus, by the definition of G2, (*) holds
Case 2: |P ′ ∪ δP ′| ≤ n/2. By a symmetric argument to Case 1, (*) holds.
Case 3: |P ∪ δP | > n/2 and |P ′ ∪ δP ′| > n/2. Then the sets P ∪ δP and P ′ ∪ δP ′ intersect. Thus,
since P and P ′ are disjoint, there are three cases: (1) P and δP ′ intersect, so an edge in G connects a
node in P to a node in P ′, or (2) P ′ and δP intersect, so an edge in G connects a node in P ′ to a node in
P , or (3) δP ′ and δP intersect, so there are nodes p ∈ P and p′ ∈ P ′ that are connected by a two-edge
path in G. Thus, in all cases, by the definition of G2, (*) holds.
Therefore, in all cases, (*) holds.
By Theorems 22 and 24, we have:
Corollary 25. Let G be any undirected graph with n nodes and vertex expansion ratio h. If at most
d(1− 1h2+2h+2 )ne − 1 processes crash in SG, then for every process w in SG, it is possible to implement
an atomic SWMR register writable by w and readable by all processes in SG.
5 Optimal randomized consensus in m&m systems
In the consensus problem, each process proposes some value and must decide a value such that the
following properties hold:
• Validity: Each decision value is one of the proposal values.
• Agreement: No two processes decide different values.
• Termination: Every process that does not crash eventually decides a value.
This problem cannot be solved in asynchronous distributed systems either with message-passing [10],
or with shared registers [26], but there are randomized algorithms that solve randomized consensus,
a weaker version of the consensus problem that requires Termination “only” with probability 1. In
particular, in shared-memory systems, it is known that randomized consensus can be solved for any
number of process crashes, but in message-passing systems, it can be solved if and only if fewer than
half of the processes may crash.
We now show how to solve randomized consensus in m&m systems with the maximum fault-tolerance
possible. To do so, we combine the randomized consensus algorithm by Aspnes and Herlihy [4] (henceforth
the AH algorithm), which was designed for shared-memory systems with atomic SWMR registers, with
the linearizable implementation of such registers for m&m systems that we gave in Section 3.1. Doing so,
however, is not as straightforward as it may seem: as pointed out in [13], a randomized algorithm that
works with atomic registers does not necessarily work against a strong adversary (i.e., it may lose some
of its properties, including termination) if we replace the atomic registers that it uses with linearizable
implementations of these registers.
14
S
SS
S S
S
S
S
S
S
S
S
Figure 9: A graph G
In Section 5.1, we explain why we can indeed solve randomized consensus in m&m systems by re-
placing the atomic registers used by the AH algorithm, with the linearizable implementation of atomic
registers given in Section 3.1. In Section 5.2, we note that doing so is optimal in the number of processes
crashes that it can tolerate in (both general and uniform) m&m systems. These results are summa-
rized by:
Theorem 26.
• Let SL be the general m&m system induced by a bag L = {S1, . . . , Sm} of subsets of Π = {p1, p2, . . . , pn}.
Randomized consensus can be solved in SL if and only if at most tL processes crash in SL.
• Let SG be the uniform m&m system induced by an undirected graph
G = (V,E) where V = {p1, p2, . . . , pn}. Randomized consensus can be solved if and only if at
most tG processes crash in SG.
The above theorem follows directly from Theorem 28 (Section 5.1), Theorem 29 (Section 5.2), and
Lemma 21.
It is worth noting that the (optimal) fault-tolerance achieved by our randomized consensus algorithm
for uniform m&m systems is better than the fault-tolerance of the algorithm given for such systems
in [1].9 For example, consider the undirected graph G in Figure 9 and the corresponding m&m system
SG. It turns out that the randomized consensus algorithm given in [1] tolerates at most 3 process crashes
in system SG, but our algorithm tolerates up to 4 process crashes in this system (because tG = 4 for this
G).
As another example, consider the Hoffman-Singleton graph G (Section 4, Figure 7) and the corre-
sponding m&m system SG with 50 processes. As we explained in Section 4, our randomized consensus
algorithm is wait-free, i.e., it tolerates up to tG = 49 process crashes in SG. In contrast, it can be shown
that the algorithm given in [1] cannot tolerate more than 45 process crashes in SG.
As a final example, consider any graph G with n nodes and expansion ration h. The randomized
consensus algorithm in [1] can tolerate at least d(1 − 12h+2 )ne − 1 process crashes in the m&m systemSG (Theorem 4.3 in [1]). In contrast, by Theorems 26 and Theorem 24 we have:
Corollary 27. For any undirected graph G with n nodes and vertex expansion ratio h, there is a ran-
domized consensus algorithm that tolerates at least d(1− 1h2+2h+2 )ne − 1 process crashes in the uniform
m&m system SG.
5.1 Solving randomized consensus
The randomized consensus algorithm by Aspnes and Herlihy [4] was originally proved to work against
a strong adversary in a shared-memory system under the assumption that the SWMR registers that
it uses are atomic (i.e., instantaneous). As we mentioned before, replacing the atomic registers of
a randomized algorithm with linearizable implementations of atomic registers may not preserve the
properties of this algorithm: to preserve them may require strongly linearizable implementations, rather
than just linearizable implementations [13]. Our atomic register implementation for m&m systems,
however, is not strongly linearizable (see Appendix A). So prima facie, combining the AH algorithm
with our implementations of atomic registers may not work against a strong adversary in m&m systems.
9[1] considers randomized consensus algorithms only for uniform m&m systems.
15
It was recently shown [14], however, that if we replace the atomic SWMR registers used by the AH
algorithm with any linearizable implementation of atomic registers (such as the one that we give for
m&m systems in Section 3.1), we obtain a randomized consensus algorithm that does work against a
strong adversary. So, from Theorem 17 in Section 3.1 and Theorem 20 in [14], we have:
Theorem 28. Let SL be the general m&m system induced by a bag L = {S1, . . . , Sm} of subsets of Π =
{p1, p2, . . . , pn}. By replacing the atomic SWMR registers used by the randomized consensus algorithm
given in [4] with the linearizable implementation of such registers for system SL given in Section 3.1, we
obtain an algorithm that solves randomized consensus in SL and tolerates up to tL process crashes.
It is worth noting that [14] proved that the AH algorithm does not need register atomicity or lin-
earizability to work: in fact this algorithm works against a strong adversary even with regular SWMR
registers [24]. In contrast to atomic SWMR registers, each operation of a regular SWMR register spans
an interval that starts with an invocation and terminates with a response. Moreover, in contrast to lin-
earizable implementations of atomic SWMR registers, a regular SWMR register satisfies only Property 1
but not Property 2 (Section 2), and so it allows “new-old” inversions [24].
5.2 Lower bound
A fault-tolerance lower bound on solving consensus in uniform m&m systems was given in [1] (Theorem
4.4). A simple generalization of this result shows that the randomized consensus algorithm of Theorem 28
is optimal in the number of process crashes that it can tolerate in general m&m systems. More precisely:
Theorem 29. Let SL be the general m&m system induced by a bag L = {S1, . . . , Sm} of subsets of
Π = {p1, p2, . . . , pn}. Randomized consensus can be solved in SL only if at most tL processes crash.
Proof Sketch. As in [1], the proof for general general m&m systems is by a standard partition argument.
Suppose, for contradiction, that there is a randomized consensus algorithm A that tolerates t > tL
process crashes in SL (*). Since t > tL, by the Definition 6 of tL there are two disjoint subsets P and P ′
of Π, of size n− t each, such that: no set Si in L contains both a process in P and a process in P ′ (**).
Since P and P ′ are disjoint, the sets P , P ′, and Q = Π− (P ∪ P ′) form a partition of Π (see Figure 3).
Consider the following execution of A. Processes in Q take no steps (they crash at the start of this
execution). Processes in P and P ′ propose 0 and 1, respectively. Processes in P “think” that all the
processes P ′ ∪Q are crashed from the start, while processes in P ′ “think” that all the processes P ∪Q,
because:
• Each of P and P ′ contains n− t processes and up to t process can crash in SL.
• All the messages between processes in P and P ′ are delayed, and
• by (**):
– no value written by any process in P ′ on a shared register can be read by any process in P .
– no value written by any process in P on a shared register can be read by any process in P ′.
Since the consensus algorithm A tolerates t crashes and terminates with probability 1, every process
in P and P ′ eventually decides 0 and 1, respectively (all the delayed messages between them are received
only after they decide); this violates the Agreement property of consensus.
6 Concluding remarks
Hybrid systems that combine message passing and shared memory have long been a subject of study
in the systems community [3, 6, 7, 8, 22, 23, 27, 30]. To the best of our knowledge, however, such
systems have only recently been examined from a theoretical point of view. Aguilera et al. gave a
rigorous model for hybrid systems, and studied how the combination of message passing and shared
memory can be harnessed to improve solutions to certain fundamental problems: In particular, they show
that, compared to a pure message-passing system, a hybrid system can improve the fault tolerance of
randomized consensus algorithms and reduce the synchrony necessary to elect a leader [1]. A more recent
paper by Aguilera et al. extends the hybrid model to Byzantine failures, and shows how to improve the
inherent trade-off between fault tolerance and performance for consensus, for both Byzantine and crash
failures [2]. The present paper is another contribution to the theoretical study of hybrid systems: whereas
16
the well-known ABD algorithm implements an atomic SWMR register with optimal fault tolerance in a
pure message-passing system [5], here we implement such registers with optimal fault tolerance in hybrid
systems. We also show how to solve randomized consensus with optimal fault tolerance in such systems.
Extending our results to hybrid systems with Byzantine failures is a subject for future research.
References
[1] M. K. Aguilera, N. Ben-David, I. Calciu, R. Guerraoui, E. Petrank, and S. Toueg. Passing messages
while sharing memory. In Proceedings of the 2018 ACM Symposium on Principles of Distributed
Computing, PODC 2018, pages 51–60, July 2018.
[2] M. K. Aguilera, N. Ben-David, R. Guerraoui, V. Marathe, and I. Zablotchi. The impact of RDMA
on agreement. In Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing,
PODC 2019, pages 409–418, July 2019.
[3] C. Amza, A. L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel.
TreadMarks: Shared memory computing on networks of workstations. IEEE Computer, 29(2):18–28,
Feb. 1996.
[4] J. Aspnes and M. Herlihy. Fast randomized consensus using shared memory. Journal of Algorithms,
11(3):441–461, Sept. 1990.
[5] H. Attiya, A. Bar-Noy, and D. Dolev. Sharing memory robustly in message-passing systems. Journal
of the ACM, 42(1):124–142, Jan. 1995.
[6] A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schu¨pbach,
and A. Singhania. The multikernel: A new OS architecture for scalable multicore systems. In ACM
Symposium on Operating Systems Principles, pages 29–44, Oct. 2009.
[7] J. K. Bennett, J. B. Carter, and W. Zwaenepoel. Munin: Distributed shared memory based on type-
specific memory coherence. In ACM Symposium on Principles and Practice of Parallel Programming,
pages 168–176, Mar. 1990.
[8] T. David, R. Guerraoui, and M. Yabandeh. Consensus inside. In International Middleware Confer-
ence, pages 145–156, Dec. 2014.
[9] A. Dragojevic´, D. Narayanan, M. Castro, and O. Hodson. FaRM: Fast remote memory. In Sympo-
sium on Networked Systems Design and Implementation, pages 401–414, Apr. 2014.
[10] M. J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibility of distributed consensus with one
faulty process. J. ACM, 32(2):374–382, Apr. 1985.
[11] Gen-Z draft core specification—December 2016. http://genzconsortium.org/
draft-core-specification-december-2016.
[12] Gen-Z DRAM and persistent memory theory of operation. https://genzconsortium.org/
wp-content/uploads/2019/03/Gen-Z-DRAM-PM-Theory-of-Operation-WP.pdf.
[13] W. Golab, L. Higham, and P. Woelfel. Linearizable implementations do not suffice for randomized
distributed computation. In Proceedings of the 2011 ACM Symposium on Theory of Computing,
STOC 2011, page 373–382, June 2011.
[14] V. Hadzilacos, X. Hu, and S. Toueg. Randomized consensus with regular registers. arXiv:2006.06771
[cs.DC], https://arxiv.org/abs/2006.06771, June 2020.
[15] M. Herlihy and J. Wing. Linearizability: A correctness condition for concurrent objects. ACM
Trans. Program. Lang. Syst., 12(3):463–492, July 1990.
[16] A. J. Hoffman and R. R. Singleton. On Moore graphs with diameters 2 and 3. IBM Journal of
Research and Development, 4(5):497–504, Nov. 1960.
[17] S. Hoory, N. Linial, and A. Wigderson. Expander graphs and their applications. BULL. AMER.
MATH. SOC., 43(4):439–561, Aug. 2006.
17
[18] InfiniBand. http://www.infinibandta.org/content/pages.php?pg=about_us_infiniband.
[19] iWARP. https://en.wikipedia.org/wiki/IWARP.
[20] A. Kalia, M. Kaminsky, and D. G. Andersen. Using RDMA efficiently for key-value services. In ACM
SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer
Communications, pages 295–306, Aug. 2014.
[21] A. Kalia, M. Kaminsky, and D. G. Andersen. FaSST: Fast, scalable and simple distributed trans-
actions with two-sided (RDMA) datagram RPCs. In Symposium on Operating Systems Design and
Implementation, pages 185–201, Nov. 2016.
[22] S. Kaxiras, D. Klaftenegger, M. Norgren, A. Ros, and K. Sagonas. Turning centralized coherence
and distributed critical-section execution on their head: A new approach for scalable distributed
shared memory. In Proceedings of the 24th International Symposium on High-Performance Parallel
and Distributed Computing, HPDC 2015, pages 3–14, June 2015.
[23] D. Kranz, K. Johnson, A. Agarwal, J. Kubiatowicz, and B.-H. Lim. Integrating message-passing
and shared-memory: Early experience. In ACM Symposium on Principles and Practice of Parallel
Programming, pages 54–63, May 1993.
[24] L. Lamport. On interprocess communication Parts I–II. Distributed Computing, 1(2):77–101, May
1986.
[25] K. Lim, J. Chang, T. Mudge, P. Ranganathan, S. K. Reinhardt, and T. F. Wenisch. Disaggregated
memory for expansion and sharing in blade servers. In International Symposium on Computer
Architecture, pages 267–278, June 2009.
[26] M. C. Loui and H. H. Abu-Amara. Memory requirements for agreement among unreliable asyn-
chronous processes. Advances in Computing research, 4(163-183):31, 1987.
[27] J. Nelson, B. Holt, B. Myers, P. Briggs, L. Ceze, S. Kahan, and M. Oskin. Latency-tolerant software
distributed shared memory. In USENIX Annual Technical Conference, pages 291–305, July 2015.
[28] M. Poke and T. Hoefler. Dare: High-performance state machine replication on RDMA networks.
In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed
Computing, HPDC 2015, pages 107–118, June 2015.
[29] RDMA over converged ethernet. https://en.wikipedia.org/wiki/RDMA_over_Converged_
Ethernet.
[30] D. J. Scales, K. Gharachorloo, and C. A. Thekkath. Shasta: A low overhead, software-only approach
for supporting fine-grain shared memory. In International Conference on Architectural Support for
Programming Languages and Operating Systems, pages 174–185, Oct. 1996.
[31] S.-Y. Tsai and Y. Zhang. LITE kernel RDMA support for datacenter applications. In ACM Sym-
posium on Operating Systems Principles, pages 306–324, Oct. 2017.
[32] Figure by Uzyel - Own work, CC BY-SA 3.0. https://commons.wikimedia.org/w/index.php?
curid=10378641.
[33] J. Yang, J. Izraelevitz, and S. Swanson. Orion: A distributed file system for non-volatile main mem-
ory and RDMA-capable networks. In 17th USENIX Conference on File and Storage Technologies,
FAST 2019, pages 221–234, Feb. 2019.
Appendix A Algorithm 1 is not strongly linearizable
We now prove that our implementation of atomic SWMR registers for m&m systems given in Section
3.1 is not strongly linearizable. To do so, we show that the ABD algorithm [5] that implements atomic
SWMR registers for pure message-passing systems is not strongly linearizable; recall that the ABD
algorithm is a special case of Algorithm 1.
First recall the definition of strong linearizability [13]:
18
Definition 30. Let H be a prefix-closed set of histories. H is strongly linearizable if there exists a
function f mapping histories in H to sequential histories, such that
• for any H ∈ H, f(H) is a linearization of H, and
• for any G,H ∈ H, if G is a prefix of H, then f(G) is a prefix of f(H).
Definition 31. An implementation of a shared object type is strongly linearizable if the set of histories
of the implementation is strongly linearizable.10
Algorithm 2 The ABD implementation of an atomic SWMR register writeable by process w and
readable by all processes in a message-passing system S, provided that at most bn−12 c processes crash.
R[p] : local register writeable and readable only by p ;
initialized to 〈0, u0〉.
Write(〈snw, u〉): . executed by the writer w
1: send 〈W, 〈snw, u〉〉 to every process p in S
2: wait for 〈ACK-W, snw〉 messages from dn+12 e distinct processes
3: return
. executed by every process p in S
4: upon receipt of a 〈W, 〈snw, u〉〉 message from process w:
5: 〈sn,−〉 ← R[p]
6: if snw > sn then
7: R[p]← 〈snw, u〉
8: send 〈ACK-W, snw〉 to process w
Read(): . executed by any process q
9: snr ← snr + 1
10: send 〈R, snr〉 to every process p in S
11: wait for 〈ACK-R, snr, 〈−,−〉〉 messages from dn+12 e distinct processes
12: 〈seq, val〉 ← max{〈r sn, r u〉 | received a 〈ACK-R, snr, 〈r sn, r u〉〉 message}
13: Write(〈seq, val〉)
14: return 〈seq, val〉
. executed by every process p in S
15: upon receipt of a 〈R, snr〉 message from a process q:
16: 〈r sn, r u〉 ← R[p]
17: send 〈ACK-R, snr, 〈r sn, r u〉〉 to process q
Theorem 32. The ABD implementation of an atomic SWMR register in pure message-passing systems
(shown in Algorithm 2) is not strongly linearizable.
Proof. Consider a pure message-passing system S with 3 processes, namely, w, p, q. Let R be the atomic
SWMR register implemented by Algorithm 2 in S. R can be written by w and read by all processes of
S. Let H be the set of histories of the Algorithm 2 (in these histories we omit all steps other than the
invocations and responses of read and write operations on R). To prove that Algorithm 2 is not strongly
linearizable, we show that H is not strongly linearizable. More precisely, we prove that for any function
f that maps histories in H to sequential histories, there exist histories G,H ∈ H such that G is a prefix
of H but f(G) is not a prefix of f(H).
Let f be a function that maps histories in H to sequential histories. Consider the following history
G ∈ H (shown in Figure 10):
• Initially, R contains v0, and so all the local registers R[−] contain the value v0.
10In a history of an object implementation, we omit all steps other than the invocation and response steps on that object.
19
*Z
W
T
S
ZULWHY
W
P
P
W
UHDG
W W
Figure 10: History G
• At time t1, process p starts an operation r to read R. According to line 10 of Algorithm 2, p first
sends 〈R, snr〉 to all processes, then:
– p receives 〈R, snr〉 from itself, reads 〈0, v0〉 from R[p] (line 16), and sends 〈ACK-R, snr, 〈0, v0〉〉
to itself (line 17). And so p receives 〈ACK-R, snr, 〈0, v0〉〉 from itself.
– let m0 denote the message 〈R, snr〉 from p to w; delay m0. Since w does not receive m0, w
takes no step.
– q receives the message 〈R, snr〉 from p and reads 〈0, v0〉 from R[q] (line 16). Then q sends back
〈ACK-R, snr, 〈0, v0〉〉 to p (line 17), say at time t2. Letm1 denote the message 〈ACK-R, snr, 〈0, v0〉〉
from q to p and delay m1.
• At some time t3 > t2, the writer w starts an operation w to write the value v1 into R with sequence
number 1, for some v1 6= v0. Process w first sends the message 〈W, 〈1, v1〉〉 to all processes (line 1)
including itself, but the message to p is delayed. Processes w and q receive 〈W, 〈1, v1〉〉 from w, and
since R[w] and R[q] contain 〈0, v0〉, by line 6 of Algorithm 2, both w and q write 〈1, v1〉 to R[w]
and R[q] respectively (line 7), and they send 〈ACK-W, 1〉 to w (line 8). So w receives 〈ACK-W, 1〉
from w and q. By line 2, the write operation w terminates, say at time t4.
• After time t4, w receives the delayed message m0 from p. Since now R[w] contains 〈1, v1〉, w reads
〈1, v1〉 in line 16. And so w sends 〈ACK-R, snr, 〈1, v1〉〉 to p (line 17), say at time t5. Let m2 denote
the message 〈ACK-R, snr, 〈1, v1〉〉 from w to p; delay m2.
Since the write operation w terminates in G and f(G) is a linearisation of G, w is in f(G). Since
the read operation r is concurrent with w, there are two cases: (1) r is before w in f(G), (2) r is not
before w in f(G).
Z
W
T
S
W
P
W W WW
ZULWHY
UHDG
P
+
*
Figure 11: History H of Case 1
Case 1: r is before w in f(G). Consider the following history H ∈ H (Figure 11):
• H is an extension of G, i.e., G is a prefix of H.
20
• At time t6 > t5, p receives the delayed message m2 from w. Since p receives 〈0, v0〉 from itself and
receives 〈1, v1〉 from w, by line 12 and 13, p selects 〈1, v1〉 and returns 〈1, v1〉 in line 14, i.e., the
read operation r of p returns v1.
Since the read operation r of p returns v1 in H, and f(H) is a linearisation of H, by Property 1 of
linearizable atomic SWMR register implementation, r is after w in f(H). However, since, by assumption,
r is before w in f(G), f(G) is not a prefix of f(H).
+
Z
W
T
S
W W
P
W W W
ZULWHY
UHDG
P
*
Figure 12: History H of Case 2
Case 2: r is not before w in f(G). Consider the following history H ∈ H (Figure 12):
• G is a prefix of H.
• At time t6 > t5, p receives the delayed message m1 from q. Since p receives 〈0, v0〉 from both itself
and q, by line 12 and 13, p selects 〈0, v0〉 and returns 〈0, v0〉 in line 14, i.e., the read operation r of
p returns v0.
Since the read operation r of p returns v0 in H, and f(H) is a linearisation of H, by Property 1
of linearizable atomic SWMR register implementation, r is before w in f(H). However, since, by
assumption, r is not before w in f(G), f(G) is not a prefix of f(H).
So in both cases, there is a history H ∈ H such that G is a prefix of H but f(G) is not a prefix of
f(H). Therefore the theorem holds.
21
