Optimal Register Construction in M&M Systems by Hadzilacos, Vassos et al.
Optimal Register Construction in M&M Systems
Vassos Hadzilacos
Department of Computer Science, University of Toronto, Canada
vassos@cs.toronto.edu
Xing Hu
Department of Computer Science, University of Toronto, Canada
xing@cs.toronto.edu
Sam Toueg
Department of Computer Science, University of Toronto, Canada
sam@cs.toronto.edu
Abstract
Motivated by recent distributed systems technology, Aguilera et al. introduced a hybrid model of
distributed computing, called message-and-memory model or m&m model for short [1]. In this model,
processes can communicate by message passing and also by accessing some shared memory. We
consider the basic problem of implementing an atomic single-writer multi-reader (SWMR) register
shared by all the processes in m&m systems. Specifically, we give an algorithm that implements
such a register in m&m systems and show that it is optimal in the number of process crashes that
it can tolerate. This generalizes the well-known implementation of an atomic SWMR register in a
pure message-passing system [4].
2012 ACM Subject Classification Theory of computation → Concurrency; Theory of computation
→ Parallel computing models; Theory of computation → Distributed computing models
Keywords and phrases asynchronous distributed system, shared memory, message passing
Digital Object Identifier 10.4230/LIPIcs.OPODIS.2019.28
Funding This research was partially funded by the Natural Sciences and Engineering Research
Council of Canada.
1 Introduction
Motivated by recent distributed systems technology [9, 12, 13, 19, 22], Aguilera et al.
introduced a hybrid model of distributed computing, called message-and-memory model or
m&m model for short [1]. In this model processes can communicate by message passing and
also by accessing some shared memory. Since it is impractical to share memory among all
processes in large distributed systems [8, 14, 15, 24], the m&m model allows us to specify
which subsets of processes share which sets of registers. Among other results, Aguilera et al.
show that it is possible to leverage the advantages of the two communication mechanisms
(message-passing and share-memory) to improve the fault-tolerance of randomized consensus
algorithms compared to a pure message-passing system [1].
In this paper, we consider the more basic problem of implementing an atomic single-
writer multi-reader (SWMR) register shared by all the processes in m&m systems, and
we give an algorithm that is optimal in the number of process crashes that it can tolerate.
This generalizes the well-known implementation of an atomic SWMR register in a pure
message-passing system [4]. We now describe our results in more detail.
A general m&m system SL is specified by a set of n asynchronous processes that
can send messages to each other over asynchronous reliable links, and by a collection
L = {S1, S2, . . . , Sm} where each Si is a subset of processes: for each Si, there is a set of
atomic registers that can be shared by processes in Si and only by them. Even though
© Vassos Hadzilacos, Xing Hu, and Sam Toueg;
licensed under Creative Commons License CC-BY
23rd International Conference on Principles of Distributed Systems (OPODIS 2019).
Editors: Pascal Felber, Roy Friedman, Seth Gilbert, and Avery Miller; Article No. 28; pp. 28:1–28:16
Leibniz International Proceedings in Informatics
Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany
28:2 Optimal Register Construction in M&M Systems
the m&m model allows the collection L to be arbitrary, in practice hardware technology
imposes a structure on L [8, 14]: for processes to share memory, they must establish a
connection between them (e.g., an RDMA connection). These connections are naturally
modelled by an undirected shared-memory graph G whose nodes are the processes and whose
edges are shared-memory connections [1]. Such a graph G defines what Aguilera et al. call a
uniform m&m system SG, where each process has atomic registers that it can share with its
neighbours in G (and only with them). Note that SG is the instance of the general m&m
system SL with L = {S1, S2, . . . , Sn} where each Si consists of a process and its neighbours
in G. Furthermore, if G is the trivial graph with n nodes but no edges, the m&m system SG
that G induces is just a pure message-passing asynchronous system with n processes.
We consider the implementation of an atomic SWMR register R, shared by all the
processes, in both general and uniform m&m systems. For each general m&m system SL, we
determine the maximum number of crashes tL for which it is possible to implement R in SL:
we give an algorithm that tolerates tL crashes and prove that no algorithm can tolerate more
than tL crashes. Similarly, for each shared-memory graph G and its corresponding uniform
m&m system SG, we use the topology of G to determine the maximum number of crashes tG
for which it is possible to implement R in SG. By specifying tG in terms of the topology of
G, one can leverage results from graph theory to design m&m systems that can implement
R with high fault tolerance and relatively few RDMA connections per process. For example,
it allows us to design an m&m system with 50 processes that can implement a wait-free R
(i.e., this implementation can tolerate any number of process crashes) with only 7 RDMA
connections per process; as explained in Section 4, this is optimal in some precise sense.
An important remark is now in order. In this paper we consider RDMA systems where
process crashes do not affect the accessibility of the shared registers of that system. This
is the case in systems where the CPU, the DRAM (main memory), and the NIC (Network
Interface Controller) are separate entities: for example, in the InfiniBand cluster evaluated
in [21], the crash of a CPU, and of the processes that it hosts, does not prevent other
processes from accessing its DRAM because it can use the NIC without involving the CPU;
see also [8, 10, 26].
2 Model outline
We consider m&m systems with a set of n asynchronous processes Π = {p1, p2, . . . pn} that
may crash. To define these systems we first recall the definition of atomic SWMR registers.
2.1 Atomic SWMR registers
A SWMR register R shared by a set S of processes is a register that can be written
(sequentially) by exactly one process w ∈ S and can be read by all processes in S; we say
that w is the writer of R [18].
We now define an atomic SWMR register [4, 18] in terms of two simple properties that
they must satisfy. To do so, we first define what it means for a (read or write) operation
to precede another operation, and for two operations to be concurrent. We say that an
operation o precedes another operation o′ if and only if o completes before o′ starts. A write
operation o immediately precedes a read operation r if and only if o precedes r, and there
is no write operation o′ such that o precedes o′ and o′ precedes r. Operations o and o′ are
concurrent if and only if neither precedes the other.
V. Hadzilacos, X. Hu, and S. Toueg 28:3
We assume, without loss of generality, that the values successively written by the single
writer w of a SWMR register R are distinct, and different from the initial value of R.1 Let
v0 denote the initial value of R, and vk denote the value written by the k-th write operation
of w. A SWMR register R is atomic if and only if it satisfies the following two properties:
I Property 1. If a read operation r returns the value v then:
there is a write v operation that immediately precedes r or is concurrent with r, or
no write operation precedes r and v = v0.
I Property 2. If two read operations r and r′ return values vk and vk′ , respectively, and r
precedes r′, then k ≤ k′.
2.2 General m&m systems
Let L = {S1, S2, . . . , Sm} be any bag of non-empty subsets of Π = {p1, p2, . . . pn}.
I Definition 3. ML is the class of m&m systems (induced by L), each consisting of:
1. The processes in Π.
2. Reliable asynchronous communication links between every pair of processes in Π.
3. The following set of registers: For each subset of processes Si in L, a non-empty set of
atomic registers that are shared by the processes in Si (and only by them).
Note thatML includes m&m systems that differ by the type and number of registers
shared by the processes in each Si; for example they could be sharing multi-writer multi-reader
atomic registers.
Since we are interested in implementing atomic SWMR registers (shared by all processes
in the system), here we focus on an m&m system ofML in which the set of registers shared
by the processes in each set Si are atomic SWMR registers. More precisely, we focus on the
m&m system SL defined below:
I Definition 4. The general m&m system SL (induced by L) consists of:
1. The processes in Π.
2. Reliable asynchronous communication links between every pair of processes in Π.
3. The following set of registers: For each subset of processes Si in L and each process
p ∈ Si, an atomic SWMR register, denoted Ri[p], that can be written by p and read by all
processes in Si (and only by them).
In this paper, for every L, we give an algorithm that implements atomic SWMR registers
shared by all processes in the m&m system SL, and we show that this algorithm is optimal
in the number of process crashes that can be tolerated. In fact we prove that any algorithm
that implements such registers in any m&m system inML, (not only in SL) cannot tolerate
more crashes. This justifies our focus on SL: considering members of ML with more or
stronger registers than SL does not improve the fault tolerance of implementing atomic
SWMR registers shared by all.
Without loss of generality we assume the following:
I Assumption 5. The bag L = {S1, S2, . . . , Sm} of subsets of Π = {p1, p2, . . . pn} is such
that every process in Π is in at least one of the subsets Sj of L.
1 This can be ensured by the writer w writing values of the form 〈sn, v〉 where sn is the value of a counter
that w increments before each write.
OPODIS 2019



















Figure 2 The uniform m&m system SG induced by G.
This assumption can be made without loss of generality because it does not strengthen the
system SL induced by L. In fact, given a bag L that does not satisfy the above assumption,
we can construct a bag that satisfies the assumption as follows: for every process pi in Π that
is not contained in any Sj of L, we can add the singleton set {pi} to L. Let L′ be the resulting
bag. By Definition 4(3) above, adding {pi} to L results in adding only a local register to the
induced system SL, namely, an atomic register that pi (trivially) shares only with itself. So
SL′ is just SL with some additional local registers. Note that a pure message-passing system
(with no shared memory) with n processes p1, p2, . . . pn is modelled by the system SL where
L = {{p1}, {p2}, . . . {pn}}.
2.3 Uniform m&m systems
Let G = (V,E) be an undirected graph such that V = Π, i.e., the nodes of G are the n
processes p1, p2, . . . , pn of the system. For each pi ∈ V , let N(pi) = {pj | (pi, pj) ∈ E} be
the neighbours of pi in G, and let N+(pi) = N(pi) ∪ {pi}.
I Definition 6. The uniform m&m system SG (induced by G) is the m&m system SL
where L = {S1, S2, . . . , Sn} with Si = N+(pi).2
The graph G induces the uniform m&m system SG where processes can communicate
by message passing (via reliable asynchronous communication links), and also by shared
memory as follows: for each process pi, and every neighbour p of pi in G (including pi) there
is an atomic SWMR register Ri[p] that can be written by p and read by every neighbour of
pi in G (including pi). We can think of the registers Ri[∗] as being physically located in the
DRAM of the host of pi, and every neighbour of pi accessing these registers over its RDMA
connection to pi (which is modelled by an edge of G).3
For example, in Figures 1 and 2 we show a graph G and the uniform m&m system
SG induced by G. Here G has five nodes representing processes p1, p2, p3, p4, p5; the edges
of G represent the RDMA connections that allow these processes to share registers. The
uniform m&m system SG induced by G is the system SL for L = {S1, S2, S3, S4, S5} where
2 Note that L satisfies Assumption 5 because each Si = N+(pi) contains pi.
3 As we mentioned in the introduction, we assume that the crash of pi does not prevent the neighbours of pi
from accessing the shared registers Ri[∗].
V. Hadzilacos, X. Hu, and S. Toueg 28:5
each Si consists of pi and its neighbours in G: specifically, S1 = {p1, p2}, S2 = {p1, p2, p3},
S3 = {p2, p3, p4, p5}, and S4 = S5 = {p3, p4, p5}. The box adjacent to each process pi in SG
represents the atomic SWMR registers that are shared among pi and its neighbours in G
(intuitively these registers are located at pi). For example, in the box adjacent to process p2,
the component labelled p1 represents the register R2[p1] that can be written by p1 and read
by all the neighbours of p2 in G, namely p1, p2, and p3. Similarly, registers R2[p2] and R2[p3]
can be written by p2 and p3, respectively, and read by p1, p2, and p3. The dashed lines in
Figure 2 represent the asynchronous message-passing links between the processes of SG.
3 Atomic SWMR register implementation in general m&m systems
Let SL be the general m&m system induced by a bag L = {S1, . . . , Sm} of subsets of
Π = {p1, p2, . . . pn}. Recall that in system SL, for every Si in L, the processes in Si share
some atomic SWMR registers that can be read only by the processes in Si (recall that it is
impractical to share registers among all processes in large distributed systems [8, 14, 15, 24]).
In the rest of the paper, we determine the maximum number of process crashes tL that may
occur in SL such that it is possible to implement in SL a shared atomic SWMR register
readable by all processes in SL. Intuitively, if t ≤ tL processes may crash, then any two
subsets of processes of size n − t either intersect, or they each contain a process that can
communicate with the other via a shared SWMR register that it can write and the other can
read. If t > tL processes may crash, then there are two subsets of processes of size n− t that
are disjoint and cannot communicate via shared SWMR register.
I Definition 7. Given a bag L = {S1, . . . , Sm} of subsets of Π = {p1, p2, . . . pn}, tL is the
maximum integer t such that the following condition holds: For all disjoint subsets P and P ′
of Π of size n− t each, some set Si in L contains both a process in P and a process in P ′.
Note that if t ≤ b(n−1)/2c then there are no disjoint subsets P and P ′ of Π of size n− t each,
and so the above condition is vacuously true. Therefore tL ≥ b(n− 1)/2c. Recall that for a
pure message-passing system, L = {{p1}, {p2}, . . . {pn}}, so in this system tL = b(n− 1)/2c.
To illustrate Definition 7, suppose Π = {p1, p2, p3, p4, p5} and L = {S1, S2, S3} where
S1 = {p1, p2}, S2 = {p4, p5}, and S3 = {p2, p4, p3}. By the definition of tL: (1) tL ≥ 3
because for any two disjoint subsets of Π of size 5− 3 = 2 each, there exists a set Si in L
that intersects both subsets; e.g., for subsets {p1, p5} and {p3, p4}, the set S2 = {p4, p5}
intersects both of them. (2) tL < 4 because there are two disjoint subsets {p1}, {p5} of size
5− 4 = 1 each, such that no set Si in L contains both p1 and p5. So in this example n = 5
and tL = 3 > b(5− 1)/2c = 2.
We now prove that in the general m&m system SL, it is possible to implement an atomic
SWMR register readable by all processes if and only if at most tL processes may crash in SL.
More precisely:
I Theorem 8. Let SL be the general m&m system induced by a bag L = {S1, . . . , Sm} of
subsets of Π = {p1, p2, . . . pn}.
If at most tL processes crash in SL, then for every process w in SL, it is possible to
implement an atomic SWMR register writable by w and readable by all processes in SL.
If more than tL processes crash in SL, then for some process w in SL, it is impossible to
implement an atomic SWMR register writable by w and readable by all processes in SL.
The above theorem is a direct corollary of Theorem 18 (Section 3.1) and Theorem 19
(Section 3.2).
OPODIS 2019
28:6 Optimal Register Construction in M&M Systems
3.1 Algorithm
We now show how to implement an atomic SWMR register R, that can be written by an
arbitrary fixed process w and read by all processes, in an m&m system SL where at most tL
processes may crash. This implementation is given in terms of the procedures Write() and
Read() shown in Algorithm 1.
Without loss of generality we assume that for all i ≥ 1, the i-th value that the writer
writes is of the form 〈i, val〉, and the initial value of the register R is 〈0, u0〉. To write 〈i, val〉
into R, the writer w calls the procedure Write(〈i, val〉). To read R, a process q calls the
procedure Read() which returns a value of the form 〈i, val〉. The sequence number i makes
the values written to R unique.
Algorithm 1 Implementation of an atomic SWMR register writeable by process w and readable
by all processes in SL, provided that at most tL processes crash.
Shared variables
For all Si in L and all p in Si:
Ri[p] : atomic SWMR register writeable by p and readable by every process in Si ∈ L;
initialized to 〈0, u0〉.
Write(〈snw, u〉): . executed by the writer w
1: send 〈W, 〈snw, u〉〉 to every process p in SL
2: wait for 〈ACK-W, snw〉 messages from n− tL distinct processes
3: return
. executed by every process p in SL
4: upon receipt of a 〈W, 〈snw, u〉〉 message from process w:
5: for every i in {1, . . . ,m} such that p ∈ Si do
6: 〈sn,−〉 ← Ri[p]
7: if snw > sn then
8: Ri[p]← 〈snw, u〉
9: send 〈ACK-W, snw〉 to process w
Read(): . executed by any process q
10: snr ← snr + 1
11: send 〈R, snr〉 to every process p in SL
12: wait for 〈ACK-R, snr, 〈−,−〉〉 messages from n− tL distinct processes
13: 〈seq, val〉 ← max{〈r_sn, r_u〉 | received a 〈ACK-R, snr, 〈r_sn, r_u〉〉 message}
14: Write(〈seq, val〉)
15: return 〈seq, val〉
. executed by every process p in SL
16: upon receipt of a 〈R, snr〉 message from a process q:
17: 〈r_sn, r_u〉 ← max{〈sn, u〉 | ∃i ∈ {1, . . . ,m}, p ∈ Si and ∃p′ ∈ Si, Ri[p′] = 〈sn, u〉}
18: send 〈ACK-R, snr, 〈r_sn, r_u〉〉 to process q
V. Hadzilacos, X. Hu, and S. Toueg 28:7
Algorithm 1 generalizes the well-known implementation of an atomic SWMR register in
pure message-passing systems [4]. To write a new value into R, the writer w sends messages
to all processes asking them to write the new value into all the shared SWMR registers that
they can write in SL. The writer then waits for acknowledgments from n − tL processes
indicating that they have done so. To read R, a process sends messages to all processes asking
them for the most up-to-date value that they can find in all the shared SWMR registers
that they can read. The reader waits for n− tL responses, selects the most up-to-date value
among them, writes back that value (using the same procedure that the writer uses), and
returns it. From the definition of tL it follows that every write of R “intersects” with every
read of R at some shared SWMR register of SL. Note that since at most tL processes crash,
the waiting mentioned above does not block any process.
We now show that the procedure Write(), called by the writer w, and the procedure
Read(), called by any process q in SL, implement an atomic SWMR register R. To do so,
we show that the calls of Write() by w and of Read() by any process satisfy Properties 1
and 2 of atomic SWMR registers given in Section 2.1.
I Definition 9. The operation write(v) is the execution of Write(v) by the writer w for
some tuple v = 〈snw, u〉: this operation starts when w calls Write(v) and it completes if
and when this call returns. An operation read(v) is an execution of Read() that returns v to
some process q: this operation starts when q calls Read() and it completes when this call
returns v to q.
Let v0 = 〈0, u0〉 be the initial value of the implemented register R, and, for k ≥ 1, let
vk = 〈k,−〉 denote the k-th value written by the writer w on R. Note that all vk’s are
distinct: for all i 6= j ≥ 0, vi 6= vj .
Let SL be the general m&m system induced by a bag L = {S1, . . . , Sm} of subsets
of Π = {p1, p2, . . . pn}. To prove the correctness of the SWMR implementation shown in
Algorithm 1, we now consider an arbitrary execution of this implementation in SL under the
assumption that at most tL processes crash.
I Lemma 10. Any read(−) or write(−) operation executed by a process that does not crash
completes.
Proof. The only statements that could prevent the completion of a read(−) or write(−)
operation are the wait statements of line 2 and line 12. But since communication links are
reliable, these wait statements are for n− tL acknowledgements, and at most tL processes out
of the n processes of SL may crash, it is clear that these wait statements cannot block. J
We first note that every read operation returns some vk for k ≥ 0.
I Lemma 11. If r is a read(v) operation in the execution, then v = vk for some k ≥ 0.
Proof. This proof is straightforward and omitted here. J
The next lemma says that no read operation can read a “future” value, i.e., a value that
is written after the read completes.
I Lemma 12. If r is a read(v) operation in the execution, then either v = v0, or v = vk
such that the operation write(vk) precedes r or is concurrent with r.
Proof. This proof is straightforward and omitted here. J
OPODIS 2019
28:8 Optimal Register Construction in M&M Systems
Note that the guard in lines 6-8 (which is the only place where the shared SWMR registers
are updated) ensures that the content of each shared SWMR register in SL is non-decreasing
in the following sense:
I Observation 13. [Register monotonicity] For all 1 ≤ i ≤ m and all p ∈ Si, if Ri[p] = 〈k,−〉
at some time t and Ri[p] = 〈k′,−〉 at some time t′ ≥ t then k′ ≥ k.
I Lemma 14. For all k ≥ 1, if a call to the procedure Write(vk) returns before a read(v)
operation starts, then v = v` for some ` ≥ k.
Proof. Suppose a call to Write(vk) returns before a read(v) operation starts; we must
show that v = v` with ` ≥ k. Note that before this call of Write(vk) returns, 〈ACK-W, k〉
messages are received from a set P of n− tL distinct processes (see line 2 of the Write()
procedure). From lines 5-8, which are executed before these 〈ACK-W, k〉 messages are sent,
and by Observation 13, it is now clear that the following holds:
I Claim 14.1. By the time Write(vk) returns, every shared SWMR register in SL that can
be written by a process in P contains a tuple 〈k′,−〉 with k′ ≥ k.
Now consider the read(v) operation, say it is by process q. Recall that read(v) is an
execution of the Read() procedure that returns v to q. When q calls Read(), it increments
a local counter snr and asks every process p in SL to do the following: (a) read every SWMR
register that p can read, and (b) reply to q with a 〈ACK-R, snr, 〈r_sn, r_u〉〉 message such
that 〈r_sn, r_u〉 is the tuple with the maximum r_sn that p read. By line 12 of the Read()
procedure, q waits to receive such 〈ACK-R, snr, 〈−,−〉〉 messages from a set P ′ of n − tL
distinct processes, and q uses these messages to select the value v as follows:
v ← max{(r_sn, r_u) | q received some 〈ACK-R, snr, 〈r_sn, r_u〉〉 from a process in P ′}
Thus, by Lemma 11, it is clear that:
I Claim 14.2. v = v` where ` = max{j | q received a 〈ACK-R, snr, 〈j,−〉〉 message from a
process in P ′}.
I Claim 14.3. Some set Si in L contains both a process in P and a process in P ′.
Proof of Claim 14.3. If P and P ′ are disjoint, the claim follows directly from Definition 7
of tL. If P and P ′ intersect, let p be a process in both P and P ′. By Assumption 5, p is in
some set Si in L, and the claim follows. J
By the above claim, some set Si in L contains a process p in P and a process p′ in P ′.
Since p ∈ Si and p′ ∈ Si, Ri[p] is one of the SWMR registers that can be written by p and
read by p′. From Claim 14.1, by the time the call to Write(vk) returns, Ri[p] contains
a tuple 〈k′,−〉 such that k′ ≥ k (*). Since p′ ∈ P ′, during the execution of read(v) by q,
p′ reads all the shared SWMR registers that it can read, including Ri[p]. Since read(v)
starts after Write(vk) returns, p′ reads Ri[p] after Write(vk) returns. Thus, by (*) and
the monotonicity of Ri[p] (Observation 13), p′ reads from Ri[p] a tuple 〈r_sn,−〉 with
r_sn ≥ k′ ≥ k. Then p′ selects the tuple 〈j,−〉 with the maximum sn among all the 〈sn,−〉
tuples that it read (see line 17); note that j ≥ k. So the 〈ACK-R, snr, 〈j,−〉〉 message that
p′ sends to q, and q uses to select v, is such that j ≥ k. So, by Claim 14.2, v = v` such that
` ≥ j ≥ k. J
Lemma 14 immediately implies the following:
V. Hadzilacos, X. Hu, and S. Toueg 28:9
I Corollary 15. For all k ≥ 1, if a write(vk) operation precedes a read(v) operation then
v = v` with ` ≥ k.
We now show that Algorithm 1 satisfies Property 1 and 2 of atomic SWMR registers.
I Lemma 16. The write(−) and read(−) operations satisfy Property 1.
Proof. Suppose for contradiction that Property 1 does not hold. Thus there is a read
operation r = read(v) such that:
(a) there is no write(v) operation that immediately precedes r or is concurrent with r, and
(b) some write(−) operation precedes r, or v 6= v0.
There are two cases.
1. v = v0. By (b) above, some write(−) operation, say write(vk), precedes r. Thus write(vk)
precedes read(v0). Since k ≥ 1 this contradicts Corollary 15.
2. v 6= v0. By Lemma 12, v = vk such that the operation write(vk) precedes r, or write(vk)
is concurrent with r. By (a) above, write(vk) does not immediately precede r, and
write(vk) is not concurrent with r. Thus, write(vk) precedes, but not immediately, r.
Let write(vk′) be the write operation that immediately precedes r. Note that write(vk)
precedes write(vk′), so k < k′. Since write(vk′) precedes r = read(v), by Corollary 15,
v = v` with ` ≥ k′, so ` > k. This contradicts that v = vk.
Since both cases lead to a contradiction, Property 1 holds. J
I Lemma 17. The write(−) and read(−) operations satisfy Property 2.
Proof. We have to show that if a read(vk) operation precedes a read(vk′) operation then
k ≤ k′. Suppose read(vk) precedes read(vk′). Note that during the read(vk) operation, namely
in line 14, there is a call to the procedure Write(vk) which returns before the read(vk)
operation completes. So this call to Write(vk) returns before the read(vk′) operation starts.
By Lemma 12, k ≤ k′. J
Lemmas 10, 16 and 17 immediately imply:
I Theorem 18. Let SL be the general m&m system induced by a bag L = {S1, . . . , Sm}
of subsets of Π = {p1, p2, . . . pn}. If at most tL processes crash in SL, for every process w
in SL, Algorithm 1 implements an atomic SWMR register writable by w and readable by all
processes in SL.
3.2 Lower bound
I Theorem 19. Let SL be the general m&m system induced by a bag L = {S1, . . . , Sm} of
subsets of Π = {p1, p2, . . . pn}. If more than tL processes crash in SL, for some process w
in SL, there is no algorithm that implements an atomic SWMR register writable by w and
readable by all processes in SL.
Proof. Let SL be the general m&m system induced by a bag L = {S1, . . . , Sm} of subsets of
Π = {p1, p2, . . . pn}. Suppose for contradiction that t > tL processes can crash in SL, but for
every process w in SL, there is an algorithm Aw that implements an atomic SWMR register
writable by w and readable by all processes in SL (*).
Since t > tL, by the Definition 7 of tL there are two disjoint subsets P and P ′ of Π, of size
n− t each, such that: no set Si in L contains both a process in P and a process in P ′ (**).
Since P and P ′ are disjoint, the sets P , P ′, and Q = Π− (P ∪ P ′) form a partition of Π.
OPODIS 2019
28:10 Optimal Register Construction in M&M Systems
Let the writer w be any process in P . Let A be an algorithm that tolerates t > tL
process crashes in SL and implements an atomic SWMR register R that is writable by w
and readable by all processes in SL; this algorithm exists by our initial assumption (*).
Since |P ∪Q ∪ P ′| = n, clearly |P ∪Q| = |P ′ ∪Q| = n− (n− t) = t. Since algorithm A
tolerates t crashes, it works correctly in every execution in which all the processes in P ∪Q
or in P ′ ∪Q crash.





























Figure 3 Scenarios for Theorem 19.
Execution E1 of algorithm A is defined as follows:
The processes in P ′∪Q crash from the beginning of the execution; they take no steps in E1.
At some time tsw the writer w starts an operation to write the value v into the implemented
register R, for some v 6= v0, where v0 is the initial value of R. Since the number of
processes that crash in E1 is |P ′ ∪Q| = t, and the algorithm A tolerates t crashes, this
write operation eventually terminates, say at time tew.
After this write terminates, no process takes a step up to and including some time tsr > tew.
Note that in E1, processes in P are the only ones that take steps up to time tsr.
Execution E2 of algorithm A is defined as follows:
The processes in P ∪Q crash from the beginning of the execution; they take no steps in E2.
At time tsr, some process r ∈ P ′ starts a read operation on the implemented register R.
Since the number of processes that crash in E2 is |P ∪ Q| = t, and the algorithm A
tolerates t crashes, this read operation terminates, say at time ter.
Since no write operation precedes the read operation in E2, Property 1 of atomic SWMR
registers implies:
I Claim 19.1. At time ter in E2 the read operation returns the initial value v0 of R.
V. Hadzilacos, X. Hu, and S. Toueg 28:11
We now construct an execution E3 of the algorithm A that merges E1 and E2, and
contradicts the atomicity of the implemented R. E3 is identical to E1 up to time tsr, and it
is identical to E2 from time tsr to ter (note that in E3 processes in Q can only take steps after
time ter). To obtain this merged run E3, intuitively we delay the messages sent by processes
in P to processes in P ′ to after time ter, and we also use the fact that processes in P ′ cannot
read any of the shared registers in SL that processes in P may have written by time tsr (this
is because of (**)).
I Claim 19.2. There is an execution E3 of algorithm A such that
(a) up to and including time tew, E3 is indistinguishable from E1 to all processes.
(b) up to and including time ter, E3 is indistinguishable from E2 to all processes in P ′.
(c) No process crashes in E3.
Proof of Claim 19.2. Until time tsr, E3 is identical to E1. We now show that it is possible
to extend E3 in the time interval [tsr, ter] with the sequence of steps that the processes in P ′
executed during the same time interval in E2.4 More precisely, let s1, s2, . . . , s` be the
sequence of steps executed during the time interval [tsr, ter] in E2. Since only processes in P ′
take steps in E2, s1, s2, . . . , s` are all steps of processes in P ′. Let C02 be the configuration of
the system SL at time tsr in E2,5 and let Ci2 be the configuration that results from applying
step si to configuration Ci−12 , for all i such that 1 ≤ i ≤ `. We will prove that there are
configurations C03 , C13 , . . . , C`3 of SL extending E3 at time tsr such that:
(i) every process in P ′ has the same state in Ci3 as in Ci2;
(ii) the set of messages sent by processes in P ′ to processes in P ′, but not yet received, is
the same in Ci3 as in Ci2;
(iii) every shared register readable by processes in P ′ has the same value in Ci3 as in Ci2; and
(iv) if i 6= 0, Ci3 is the result of applying step si to configuration Ci−13 .
This is shown by induction on i.
For the basis of the induction, i = 0, we take C03 to be the configuration of the system just
before time tsr in E3. Since no process in P ′ takes a step before time tsr in either E2 or E3,
C03 satisfies properties (i) and (ii).
I Claim 19.3. At time tsr in E3 the shared registers that can be read by processes in P ′ have
their initial values.
Proof of Claim 19.3. Suppose, for contradiction, that at time tsr in E3, some shared register
R that can be read by a process p′ in P ′ does not have its initial value. By construction, E3
is identical to E1 until time tsr, and so only processes in P take steps before time tsr in E3.
Thus, register R was written by some process p in P by time tsr in E3. Since R is readable
by p′ ∈ P ′ and is written by p ∈ P , R is shared by both p and p′. Thus, there must be a set
Si in L that contains both p and p′ – a contradiction to (**). J
By Claim 19.3, the shared registers readable by processes in P ′ have the same value
(namely, their initial value) in C03 as in C02 . So, C03 also satisfies property (iii). Property (iv)
is vacuously true for i = 0.
4 A step of A executed by process p is one of the following: p sending or receiving a message, or p applying
a write or a read operation to a shared register in SL.
5 The configuration of SL at time t in execution E consists of the state of each process, the set of messages
sent but not yet received, and the value of each shared register in SL at time t in E.
OPODIS 2019
28:12 Optimal Register Construction in M&M Systems
For the induction step, for each i such that 1 ≤ i ≤ `, we consider separately the cases
of si being a step to send a message, receive a message, write a shared register, and read
a shared register. In each case, it is easy to verify that, assuming (inductively) that Ci−13
has properties (i)–(iv), step si is applicable to Ci−13 , and the resulting configuration Ci3 has
properties (i)–(iv).
To complete the definition of E3, after time ter we let processes take steps in round-robin
fashion. Whenever a process’s step is to receive a message, it receives the oldest one sent to
it; this ensures that all messages are eventually received. Processes continue taking steps in
this fashion according to algorithm A.
Since E3 is identical to E1 up to and including time tew, E3 is indistinguishable from E1
up to and including time tew to all processes in P . This proves part (a) of the claim.
Note that in E3 and E2, the processes in P ′: (a) take no steps before time tsr, and (b)
during the time interval [tsr, ter], they execute exactly the same sequence of steps, and go
through the same sequence of states. Thus, up to and including time ter, E3 is indistinguishable
from E2 to all processes in P ′. This proves part (b) of the claim.
Finally, every process takes steps as required by the algorithm in E3, so no process crashes.
This proves part (c) of the claim. J
By Claim 19.2(a), up to and including time tew, E3 is indistinguishable from E1 to
the writer w ∈ P . So E3 contains the write operation that writes v 6= v0 into R, which
starts at time tsw and ends at time tew. By Claim 19.2(b), up to and including time ter, E3 is
indistinguishable from E2 to r ∈ P ′. So E3 contains the read operation that returns v0, which
starts at time tsr and ends at time ter. Since tew < tsr, this read operation violates Property 1
of atomic SWMR registers. As there are no process crashes in E3 (by Claim 19.2(c)), this
contradicts the assumption that A is an implementation of an atomic SWMR register R
that tolerates t > tL crashes. J
Note that the proof of Theorem 19 does not depend on the type or number of registers
shared by the processes in each set Si of the bag L. So the result of this theorem applies not
only to SL but also to every m&m system inML. In fact, the proof of Theorem 19 does
not even depend on the type of objects that are shared by the processes in each set Si; for
example these objects could include queues, stacks, and consensus objects. Hence we have
the following stronger result:
I Theorem 20. Consider any m&m system S induced by a bag L = {S1, . . . , Sm} of subsets
of Π = {p1, p2, . . . pn}, where the processes in each Si share any number of arbitrary objects
among themselves (and only among themselves). If more than tL processes crash in S, then
for some process w in S, there is no algorithm that implements an atomic SWMR register
writable by w and readable by all processes in S.
4 Atomic SWMR register implementation in uniform m&m systems
Let G = (V,E) be an undirected graph where V = {p1, p2, . . . pn}, i.e., the nodes of G are
the processes p1, p2, . . . pn. Let SG be the uniform m&m system induced by G. Recall that
in SG, each process pi and its neighbours in G share some atomic SWMR registers that can
be read by (and only by) them.
We now use G to determine the maximum number of process crashes that may occur in SG
such that it is possible to implement a shared atomic SWMR register readable by all processes
in SG. To do so, we first recall the definition of the square of the graph G: G2 = (V,E′)
where E′ = {(u, v) | (u, v) ∈ E or ∃k ∈ V such that (u, k) ∈ E and (k, v) ∈ E}.














Figure 6 The Hoffman-
Singleton graph.
I Definition 21. Given an undirected graph G = (V,E) such that V = {p1, p2, . . . pn}, tG is
the maximum integer t such that the following condition holds: For all disjoint subsets P
and P ′ of V of size n − t each, some edge in G2 connects a node in P with a node in P ′;
i.e., G2 has an edge (u, v) such that u ∈ P and v ∈ P ′.
Note that tG ≥ b(n− 1)/2c. Moreover, in a pure message-passing system (where G and G2
have no edges) tG = b(n− 1)/2c.
In Theorem 22 stated below, we prove that in the uniform m&m system SG induced by a
graph G, it is possible to implement an atomic SWMR register readable by all processes
if and only if at most tG processes may crash in SG.
For example, consider the graph G in Figure 4 where V = {p1, p2, p3, p4, p5}. Figure 5
shows the corresponding G2 graph. By the above definition of tG: (a) tG ≥ 3 because for any
two disjoint subsets of V of size 5− 3 = 2 each, G2 has an edge that “connects” these two
subsets (e.g., for subsets P = {p1, p2} and P ′ = {p4, p5}, the edge (p2, p4) of G2 connects a
node of P to a node of P ′), and (b) tG < 4 because there are two disjoint subsets {p1}, {p5}
of size 5− 4 = 1 each, such that no edge in G2 connects p1 and p5. So in this example n = 5
and tG = 3 > b(5− 1)/2c = 2.
Now consider the uniform m&m system SG of 5 processes induced by this graph G. In
addition to message-passing links, SG has 4 pairwise RDMA connections. Since tG = 3, by
Theorem 22: (1) we can implement an atomic SWMR register readable by all 5 processes
of SG even if 3 of them (i.e., more than the majority) may crash, and (2) no algorithm can
implement such a register in SG if more than 3 processes may crash.
As another example, consider a pure message-passing system S with 50 nodes. In S, one
can implement an atomic SWMR register R (readable by all the processes) only if at most 24
process crashes may occur. But if we allow each process of S to establish 7 pairwise RDMA
connections, one can implement R in a way that tolerates any number of process crashes
(i.e., R is wait-free). This is because there is an undirected graph G with n = 50 nodes, each
with degree 7, such that G2 has an edge between every pair of nodes (G is the well-known
Hoffman-Singleton graph [11] shown in Figure 6 [25]); so G has tG = n− 1 = 49, and thus
by Theorem 22 it is possible to implement R in the uniform m&m system SG in a way that
tolerates up to 49 process crashes. Some simple graph theory arguments show that this is
optimal in two ways: (a) one cannot implement a wait-free register R that is shared by 50
processes with fewer than 7 RDMA connections per process (more precisely, with any such
implementation, if a process has fewer than 7 RDMA connections there must be another
process with more than 7 RDMA connections), and (b) with at most 7 RDMA connections
per process, one cannot implement a wait-free register R that is shared by more than 50
processes.
OPODIS 2019
28:14 Optimal Register Construction in M&M Systems
I Theorem 22. Let SG be the uniform m&m system induced by an undirected graph
G = (V,E) where V = {p1, p2, . . . pn}.
If at most tG processes crash in SG, then for every process w in SG, it is possible to
implement an atomic SWMR register writable by w and readable by all processes in SG.
If more than tG processes crash in SG, then for some process w in SG, it is impossible to
implement an atomic SWMR register writable by w and readable by all processes in SG.
Proof. By Definition 6, SG is the m&m system SL where L = {S1, S2, . . . , Sn} such that
Si = N+(pi), i.e., for all i, 1 ≤ i ≤ n, Si is the set of neighbours of pi (including pi) in the
graph G. Recall that tL is the maximum t such that for all disjoint subsets P and P ′ of V
of size n− t each, some set Si in L contains both a node in P and a node in P ′.
I Claim 22.1. tG = tL.
Proof of Claim 22.1. From the definitions of tG and tL, it is clear that to prove the claim
it suffices to show that for all disjoint subsets P and P ′ of V of size n− t each, the following
holds: some edge in G2 connects a node in P with a node in P ′ if and only if some set Si in
L contains both a node in P and a node in P ′.
[Only If] Suppose G2 has an edge (pi, pj) such that pi ∈ P and pj ∈ P ′; since P and P ′ are
disjoint, pi and pj are distinct. By definition of G2, there are two cases:
1. (pi, pj) ∈ E. In this case, pj ∈ N+(pi) and pi ∈ N+(pi). So the set Si = N+(pi) in L
contains both node pi ∈ P and node pj ∈ P ′.
2. There is a node pk ∈ V such that (pi, pk) ∈ E and (pk, pj) ∈ E. In this case, pi ∈ N+(pk)
and pj ∈ N+(pk). So the set Sk = N+(pk) in L contains both pi ∈ P and pj ∈ P ′.
So in both cases, some set S` in L contains both a node in P and a node in P ′.
[If] Suppose set Sk in L contains both a node pi in P and a node pj in P ′; since P and P ′
are disjoint, pi and pj are distinct. Recall that Sk = N+(pk) for node pk ∈ V .
There are two cases:
1. pi, pj and pk are distinct. In this case, since pi and pj are in Sk = N+(pk), (pi, pk) and
(pk, pj) are edges of G, i.e., (pi, pk) ∈ E and (pk, pj) ∈ E. Thus, by definition of G2,
(pi, pj) is an edge of G2; this edge connects pi ∈ P and pj ∈ P ′.
2. pk = pi or pk = pj . Without loss of generality, assume that pk = pi. Since pi and pj are
in N+(pk) = N+(pi), (pi, pj) must be an edge of G, i.e., (pi, pj) ∈ E. Thus, by definition
of G2, (pi, pj) is an edge of G2; this edge connects pi ∈ P and pj ∈ P ′.
So in both cases, some edge in G2 connects a node in P with a node in P ′. J
The result now follows immediately from Claim 22.1 and Theorem 8. J
5 Concluding remarks
Hybrid systems that combine message passing and shared memory have long been a subject
of study in the systems community [3, 5, 6, 7, 16, 17, 20, 23]. To the best of our knowledge,
however, such systems have only recently been examined from a theoretical point of view.
Aguilera et al. gave a rigorous model for hybrid systems, and studied how the combination
of message passing and shared memory can be harnessed to improve solutions to certain
fundamental problems: In particular, they show that, compared to a pure message-passing
system, a hybrid system can improve the fault tolerance of randomized consensus algorithms
and reduce the synchrony necessary to elect a leader [1]. A more recent paper by Aguilera et
al. extends the hybrid model to Byzantine failures, and shows how to improve the inherent
trade-off between fault tolerance and performance for consensus, for both Byzantine and
V. Hadzilacos, X. Hu, and S. Toueg 28:15
crash failures [2]. The present paper is another contribution to the theoretical study of hybrid
systems: whereas the highly cited paper by Attiya et al. shows how to implement an atomic
SWMR register with optimal fault tolerance in a pure message-passing system [4], here we
solve the corresponding problem in hybrid systems. Extending our results to hybrid systems
with Byzantine failures is a subject for future research.
References
1 Marcos K. Aguilera, Naama Ben-David, Irina Calciu, Rachid Guerraoui, Erez Petrank, and
Sam Toueg. Passing Messages while Sharing Memory. In Proceedings of the 2018 ACM
Symposium on Principles of Distributed Computing, PODC 2018, Egham, United Kingdom,
July 23-27, 2018, pages 51–60, 2018. doi:10.1145/3212734.3212741.
2 Marcos K. Aguilera, Naama Ben-David, Rachid Guerraoui, Virendra Marathe, and Igor
Zablotchi. The Impact of RDMA on Agreement. In Proceedings of the 2019 ACM Symposium
on Principles of Distributed Computing, PODC 2019, Toronto, ON, Canada, July 29 - August
2, 2019., pages 409–418, 2019. doi:10.1145/3293611.3331601.
3 Cristiana Amza, Alan L. Cox, Shandya Dwarkadas, Pete Keleher, Honghui Lu, Ramakrishnan
Rajamony, Weimin Yu, and Willy Zwaenepoel. TreadMarks: Shared memory computing on
networks of workstations. IEEE Computer, 29(2):18–28, February 1996.
4 Hagit Attiya, Amotz Bar-Noy, and Danny Dolev. Sharing memory robustly in message-passing
systems. Journal of the ACM, 42(1):124–142, January 1995.
5 Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Simon
Peter, Timothy Roscoe, Adrian Schüpbach, and Akhilesh Singhania. The Multikernel: A New
OS Architecture for Scalable Multicore Systems. In ACM Symposium on Operating Systems
Principles, pages 29–44, October 2009.
6 John K. Bennett, John B. Carter, and Willy Zwaenepoel. Munin: Distributed Shared Memory
Based on Type-specific Memory Coherence. In ACM Symposium on Principles and Practice
of Parallel Programming, pages 168–176, March 1990.
7 Tudor David, Rachid Guerraoui, and Maysam Yabandeh. Consensus Inside. In International
Middleware Conference, pages 145–156, December 2014.
8 Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. FaRM:
Fast remote memory. In Symposium on Networked Systems Design and Implementation, pages
401–414, April 2014.
9 Gen-Z Draft Core Specification—December 2016. URL: http://genzconsortium.org/
draft-core-specification-december-2016.
10 Gen-Z DRAM and Persistent Memory Theory of Operation. URL: https://genzconsortium.
org/wp-content/uploads/2019/03/Gen-Z-DRAM-PM-Theory-of-Operation-WP.pdf.
11 Alan J. Hoffman and Robert R. Singleton. On Moore Graphs with Diameters 2 and 3. IBM
Journal of Research and Development, 4(5):497–504, 1960. doi:10.1147/rd.45.0497.
12 InfiniBand. http://www.infinibandta.org/content/pages.php?pg=about_us_infiniband.
13 iWARP. https://en.wikipedia.org/wiki/IWARP.
14 Anuj Kalia, Michael Kaminsky, and David G. Andersen. Using RDMA Efficiently for Key-value
Services. In ACM SIGCOMM Conference on Applications, Technologies, Architectures, and
Protocols for Computer Communications, pages 295–306, August 2014.
15 Anuj Kalia, Michael Kaminsky, and David G. Andersen. FaSST: Fast, scalable and simple
distributed transactions with two-sided (RDMA) datagram RPCs. In Symposium on Operating
Systems Design and Implementation, pages 185–201, November 2016.
16 Stefanos Kaxiras, David Klaftenegger, Magnus Norgren, Alberto Ros, and Konstantinos
Sagonas. Turning Centralized Coherence and Distributed Critical-Section Execution on Their
Head: A New Approach for Scalable Distributed Shared Memory. In IEEE International
Symposium on High Performance Distributed Computing, pages 3–14, June 2015.
OPODIS 2019
28:16 Optimal Register Construction in M&M Systems
17 David Kranz, Kirk Johnson, Anant Agarwal, John Kubiatowicz, and Beng-Hong Lim. In-
tegrating Message-passing and Shared-memory: Early Experience. In ACM Symposium on
Principles and Practice of Parallel Programming, pages 54–63, 1993.
18 Leslie Lamport. On interprocess communication Part I–II. Distributed Computing, 1(2):77–101,
May 1986.
19 Kevin Lim, Jichuan Chang, Trevor Mudge, Parthasarathy Ranganathan, Steven K. Reinhardt,
and Thomas F. Wenisch. Disaggregated Memory for Expansion and Sharing in Blade Servers.
In International Symposium on Computer Architecture, pages 267–278, June 2009.
20 Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and
Mark Oskin. Latency-tolerant Software Distributed Shared Memory. In USENIX Annual
Technical Conference, pages 291–305, July 2015.
21 Marius Poke and Torsten Hoefler. DARE: High-Performance State Machine Replication on
RDMA Networks. In Proceedings of the 24th International Symposium on High-Performance
Parallel and Distributed Computing, HPDC ’15, pages 107–118, New York, NY, USA, 2015.
ACM. doi:10.1145/2749246.2749267.
22 RDMA over converged ethernet. https://en.wikipedia.org/wiki/RDMA_over_Converged_
Ethernet.
23 Daniel J. Scales, Kourosh Gharachorloo, and Chandramohan A. Thekkath. Shasta: A Low
Overhead, Software-only Approach for Supporting Fine-grain Shared Memory. In International
Conference on Architectural Support for Programming Languages and Operating Systems, pages
174–185, October 1996.
24 Shin-Yeh Tsai and Yiying Zhang. LITE kernel RDMA support for datacenter applications. In
ACM Symposium on Operating Systems Principles, pages 306–324, October 2017.
25 Figure by Uzyel - Own work, CC BY-SA 3.0. https://commons.wikimedia.org/w/index.
php?curid=10378641.
26 Jian Yang, Joseph Izraelevitz, and Steven Swanson. Orion: A Distributed File System for
Non-Volatile Main Memory and RDMA-Capable Networks. In 17th USENIX Conference on
File and Storage Technologies (FAST 19), pages 221–234, Boston, MA, February 2019. USENIX
Association. URL: https://www.usenix.org/conference/fast19/presentation/yang.
