Motivated by recent distributed systems technology [4, 8, 5, 7, 3] , Aguilera et al. introduced a hybrid model of distributed computing, called message-and-memory model or m&m model for short [1] . In this model processes can communicate by message passing and also by accessing some shared memory. We consider the basic problem of implementing an atomic single-writer multi-reader (SWMR) register shared by all the processes in m&m systems. Specifically, for every m&m system, we give an algorithm that implements such a register in this system and show that it is optimal in the number of process crashes that it can tolerate. This generalizes the well-known implementation of an atomic SWMR register in a pure messagepassing system [2] .
Introduction
Motivated by recent distributed systems technology [4, 8, 5, 7, 3] , Aguilera et al. introduced a hybrid model of distributed computing, called message-and-memory model or m&m model for short [1] . In this model processes can communicate by message passing and also by accessing some shared memory. Since it is impractical to share memory among all processes in large distributed systems, the m&m model allows us to specify which sets of processes share which sets of registers. Among other results, Aguilera et al. show that it is possible to leverage the advantages of the two communication mechanisms (message-passing and share-memory) to improve the fault-tolerance of randomized consensus algorithms compared to a pure message-passing system.
In this paper, we consider the basic problem of implementing an atomic single-writer multi-reader (SWMR) register shared by all the processes in m&m systems. Specifically, for every m&m system, we give an algorithm that implements such a register in this system and show that it is optimal in the number of process crashes that it can tolerate. This generalizes the well-known implementation of an atomic SWMR register in a pure message-passing system [2] . We now describe our results in more detail.
A general m&m system S L is specified by a set of n asynchronous processes that can send messages to each other over asynchronous reliable links, and by a collection L = {S 1 , S 2 , . . . , S m } where each S i is a subset of processes: for each S i , there is a set of registers that can be shared by processes in S i and only by them. Even though the m&m model allows the collection L to be arbitrary, in practice hardware technology imposes a structure on L: for processes to share memory, they must establish a connection between them (e.g., an RDMA connection). These connections are naturally modelled by an undirected shared-memory graph G whose nodes are the processes and whose edges are shared-memory connections [1] . Such a graph G defines what Aguilera et al. call a uniform m&m system S G , where each process has registers that it can share with its neighbours in G (and only with them). Note that S G is just the instance of the m&m system S L with L = {S 1 , S 2 , . . . , S n } where each S i consists of a process and its neighbours in G. Furthermore, if G is the trivial graph with n nodes but no edges, the m&m system S G that G induces is just a pure message-passing system.
We consider the implementation of an atomic SWMR register R, shared by all the processes, in both general and uniform m&m systems. For each general m&m system S L , we determine the maximum number of crashes t L for which it is possible to implement R in S L : we give an algorithm that tolerates t L crashes and prove that no algorithm can tolerate more than t L crashes. Similarly, for each shared-memory graph G and its corresponding uniform m&m system S G , we determine the maximum number of crashes t G for which it is possible to implement R in S G . In pure message-passing systems, it is well known that the maximum number of crashes that can be tolerated by any implementation of an atomic SWMR register is (n − 1)/2 . In contrast, for general and uniform m&m systems, t L and t G can exceed (n − 1)/2 . This confirms that m&m systems can provide greater fault tolerance compared to pure message-passing systems.
Model outline
Henceforth we consider m&m systems with a set of n processes Π = {p 1 , p 2 , . . . p n } that may crash. To define these systems we must first recall the definition of atomic SWMR registers.
Atomic SWMR registers
A SWMR register R shared by a set S of processes is a register that can be written (sequentially) by exactly one process w ∈ S and can be read by all processes in S; we say that w is the writer of R [6] .
To define an atomic SWMR register, we first informally define what it means for a (read or write) operation to precede another operation, and for two operations to be concurrent (precise definitions are given in [6] We now define atomic SWMR registers in terms of two simple properties that they must satisfy. To do so, it is convenient to assume that the values successively written by the writer w of a SWMR register R are distinct, and different from the initial value of R.
1 Henceforth, v 0 denotes the initial value of R, and v k denotes the value written by the k-th write operation of w. A SWMR register R is atomic if and only if satisfies the following two properties: Property 1. If a read operation r returns the value v then:
1. there is a write v operation that immediately precedes r or is concurrent with r, or 2. no write operation precedes r and v = v 0 . Property 2. If two read operations r and r return values v k and v k , respectively, and r precedes r , then k ≤ k .
General m&m systems
Let L = {S 1 , S 2 , . . . , S m } be any bag of non-empty subsets of Π = {p 1 , p 2 , . . . p n }. 3. For each subset of processes S i in L, a set of atomic registers that are shared by the processes in S i (and only by them).
In this paper, we focus on a particular member of the above class of m&m systems, one in which the set of registers shared by the processes in each set S i are atomic SWMR registers. This is because we are interested in implementing atomic SWMR registers (shared by all processes in the system). More precisely, we focus on the m&m system S L defined below:
1. The processes in Π.
Reliable asynchronous communication links between every pair of processes in Π.
3. For each subset of processes S i in L and each process p ∈ S i , a SWMR register R i [p] that can be written by p and read by all processes in S i (and only by them).
From the above, it is clear that processes in an m&m system S L can communicate by message passing or shared memory. For convenience henceforth we assume the following:
Assumption 5. The bag L = {S 1 , S 2 , . . . , S m } of subsets of Π = {p 1 , p 2 , . . . p n } is such that every process in Π is in at least one of the subsets S j of L.
We now explain why making this assumption does not strengthen the system S L induced by L. Given a bag L that does not satisfy the above assumption, we can construct a bag L that satisfies the assumption as follows: for every process p i in Π that is not contained in any S j of L, we can add the singleton set {p i } to L. Let L be the resulting bag. By Definition 4(3) above, adding {p i } to L results in adding a local register to the induced system S L , namely, an atomic register that p i (trivially) shares only with itself. So S L is just S L with some additional local registers.
Note that if L is just {{p 1 }, {p 2 }, . . . {p n }} then S L is just a pure message-passing system (with no shared memory).
Uniform m&m systems
Let G = (V, E) be an undirected graph such that V = Π, i.e., the nodes of G are the n processes p 1 , p 2 , . . . p n of the system. For each Figure 2: S G induced by G Graph G induces the uniform m&m system S G where processes can communicate by message passing (via reliable asynchronous communication links), and also by shared memory as follows: for each process p i , and every neighbour p j of p i in G (including p i ) there is a SWMR register R i [p j ] that can be written by p j and read by every neighbour of p i in G (including p i ). Intuitively, the registers R i [ * ] are physically located in process p i , and every neighbour of p i can access these registers over its connection to p i (shown as an edge of G).
For example, in Figures 1 and 2 we show a graph G and the uniform m&m system S G induced by G. Here G has five nodes, and
consists of the neighbours of node p i of G). The edges of G represent the RDMA connections that allow processes to share registers. The box adjacent to each process p i in S G represents the array of SWMR registers that are shared among p i and its neighbours (intuitively these registers are hosted by p i ). For example, in the box adjacent to a process p 2 , the component labelled p 1 represents the register R 2 [p 1 ] that can be written by p 1 and read by all the neighbours of p 2 , including p 2 .
SWMR register implementation in general m&m systems
Let S L be the general m&m system induced by a bag L = {S 1 , . . . , S m } of subsets of Π = {p 1 , p 2 , . . . p n }. Recall that in system S L , for every S i in L, the processes in S i share some SWMR registers that can be read only by the processes in S i . In the rest of the paper, we determine the maximum number of process crashes that may occur in S L such that it is possible to implement a shared atomic SWMR register readable by all processes in S L .
the maximum integer t such that the following holds. For all subsets P and P of Π of size n − t each:
1. P and P intersect, or 2. some set S i in L contains both a process in P and a process in P , i.e., ∃i ∈ {1, . . . , m}, ∃p 1 ∈ P and ∃p 2 ∈ P such that
We now prove that it is possible to implement an atomic SWMR register readable by all processes in the general m&m system S L if and only if at most t L processes may crash in S L . More precisely:
Theorem 8. Let S L be the general m&m system induced by a bag L = {S 1 , . . . , S m } of subsets of Π = {p 1 , p 2 , . . . p n }.
• If at most t L processes crash in S L , for every process w in S L , it is possible to implement an atomic SWMR register writable by w and readable by all processes in S L .
• If more than t L processes crash in S L , for some process w in S L , it is impossible to implement an atomic SWMR register writable by w and readable by all processes in S L .
The above theorem is a direct corollary of Theorem 18 (in Section 3.1) and Theorem 19 (in Section 3.2).
Algorithm
We now show how to implement a SWMR register R, that can be written by an arbitrary fixed process w and read by all processes, in an m&m system S L where at most t L processes may crash. This implementation is given in terms of the procedures Write() and Read() shown in Algorithm 1.
Without loss of generality, we assume that for all i ≥ 1, the i-th value that the writer writes is of the form i, val . To write i, val into R, the writer w calls the procedure Write( i, val ). To read R, a reader q calls the procedure Read() which returns a value of the form i, val . The sequence number i makes the values written to R unique.
Algorithm 1 generalizes the well-known implementation of a SWMR register in pure message-passing systems [2] . To write a new value into R the writer w sends messages to all processes asking them to write the new value into all the shared SWMR registers that they can write in S L . The writer then waits for acknowledgments from n − t L processes indicating that they have done so. To read R, a process sends messages to all processes asking them for the most up-to-date value that they can find in all the shared SWMR registers that they can read. The reader waits for n − t L responses, selects the most up-to-date value among them, writes back that value (using the same procedure that the writer uses), and returns it. From the definition of t L it follows that every update of R "intersects" with every read of R at some shared SWMR register of S L . Note that since at most t L processes crash, the waiting mentioned above does not block any process.
We now show that the procedure Write(), called by the writer w, and the procedure Read(), called by any reader q in S L , implement an atomic SWMR register R. To do so, we show that the calls of Write() by w and of Read() by readers satisfy Properties 1 and 2 of atomic SWMR registers given in Section 2.1.
Definition 9. The operation write(v) is the execution of Write(v) by the writer w for some tuple v = sn w , u : this operation starts when w calls Write(v) and it completes if and when this call returns. An operation read (v) is an execution of Read() that returns v to some process q: this operation starts when q calls Read() and it completes when this call returns v to q.
Let v 0 = 0, u 0 be the initial value of the implemented register R, and, for k ≥ 1, let v k = k, − denote the k-th value written by the writer w on R. Note that all v k 's are distinct:
Algorithm 1 Implementation of an atomic SWMR register writeable by process w and readable by all processes in S L , provided that at most t L processes crash.
Shared variables
send W, sn w , u to every process p in S L
2:
wait for ACK-W, sn w messages from n − t L distinct processes return executed by every process p in S L 4: upon receipt of a W, sn w , u message from process w:
for every i in {1, . . . , m} such that p ∈ S i do 6:
if sn w > sn then 8:
send ACK-W, sn w to process w
Read():
executed by any reader q 10:
sn r ← sn r + 1
11:
send R, sn r to every process p in S L
12:
wait for ACK-R, sn r , −, − messages from n − t L distinct processes
13:
seq, val ← max{ r sn, r u | received a ACK-R, sn r , r sn, r u message}
14:
Write( seq, val )
15:
return seq, val executed by every process p in S L 16: upon receipt of a R, sn r message from process q:
17:
r sn, r u ← max{ sn, u | ∃i ∈ {1, . . . , m}, p ∈ S i and ∃q ∈ S i , R i [q] = sn, u }
18:
send ACK-R, sn r , r sn, r u to process q Let S L be the general m&m system induced by a bag L = {S 1 , . . . , S m } of subsets of Π = {p 1 , p 2 , . . . p n }. To prove the correctness of the SWMR implementation shown in Algorithm 1, we now consider an arbitrary execution of this implementation in S L under the assumption that at most t L processes crash.
Lemma 10. Any read (−) or write(−) operation executed by a process that does not crash completes.
Proof. The only statements that could prevent the completion of a read (−) or write(−) operation are the wait statements of line 2 and line 12. But since communication links are reliable, these wait statements are for n − t L acknowledgements, and at most t L processes out of the n process of S L may crash, it is clear that these wait statements cannot block.
We first note that every read operation returns some v k for k ≥ 0.
Lemma 11. If r is a read (v) operation in the execution, then v = v k for some k ≥ 0.
Proof. This proof is straightforward and omitted here.
The next lemma says that no read operation can read a "future" value, i.e., a value that is written after the read completes.
Lemma 12. If r is a read (v) operation in the execution, then either v = v 0 , or v = v k such that the operation write(v k ) precedes r or is concurrent with r.
Note that the guard in lines 6-8 (which is the only place where SWMR registers are updated) ensures that the content of all the SWMR registers in S L are non-decreasing in the following sense:
Lemma 14. For all k ≥ 1, if a call to the procedure Write(v k ) returns before a read (v) operation starts, then v = v for some ≥ k.
Proof. Suppose a call to Write(v k ) returns before a read (v) operation starts; we must show that v = v with ≥ k. Note that before this call of Write(v k ) returns, ACK-W, k messages are received from a set P of n − t L distinct processes (see line 2 of the Write() procedure). From lines 5-8, which are executed before these ACK-W, k messages are sent, and by Observation 13, it is now clear that the following holds: Claim 14.1. By the time Write(v k ) returns, every shared SWMR register in S L that can be written by a process in P contains a tuple k , − with k ≥ k. Now consider the read (v) operation, say it is by process q. Recall that read (v) is an execution of the Read() procedure that returns v to q. When q calls Read(), it increments a local counter sn r and asks every process p in S L to do the following: (a) read every SWMR register that p can read, and (b) reply to q with a ACK-R, sn r , r sn, r u message such that r sn, r u is the tuple with the maximum r sn that p read. By line 12 of the Read() procedure, q waits to receive such ACK-R, sn r , −, − messages from a set P of n − t L distinct processes, and q uses these messages to select the value v as follows:
v ← max{(r sn, r u) | q received a ACK-R, sn r , r sn, r u message from a process in P } Thus, by Lemma 11, it is clear that: Claim 14.2. v = v where = max{j | q received a ACK-R, sn r , j, − message from a process in P }.
Recall that by the definition of t L :
1. P and P intersect, or 2. some set S i in L contains both a process in P and a process in P .
So there are two cases:
1. P and P intersect. Let p be a process in both P and P . Let S i be a set in L that contains p (this set exists by Assumption 5 Then p selects the tuple j, − with the maximum sn among all the sn, − tuples that it read (see line 17); note that j ≥ k. So the ACK-R, sn r , j, − message that p sends to q, and q uses to select v, is such that j ≥ k. So, by Claim 14.2, v = v such that ≥ j ≥ k.
2. Some set S i in L contains both a process p in P and a process p in P .
is one of the SWRM registers that can be written by p. From Claim 14.1, by the time the call to Write(v k ) returns,
Since p ∈ P , during the execution of read (v) by q, p reads all the shared SWMR registers that it can read; note that since p is in We now show that Algorithm 1 satisfies Property 1 and 2 of atomic SWMR registers.
Lemma 16. The write(−) and read (−) operations satisfy Property 1.
Proof. Suppose for contradiction that Property 1 does not hold. Thus there is a read operation r = read (v) such that:
(a) there is no write(v) operation that immediately precedes r or is concurrent with r, and (b) some write(−) operation precedes r, or v = v 0 .
There are two cases. Since both cases lead to a contradiction, Property 1 holds.
Lemma 17. The write(−) and read (−) operations satisfy Property 2.
Proof. We have to show that if a read (v k ) operation precedes a read (v k ) operation then k ≤ k . Suppose read (v k ) precedes read (v k ). Note that during the read (v k ) operation, namely in line 14, there is a call to the procedure Write(v k ) which returns before the read (v k ) operation completes. So this call to Write(v k ) returns before the read (v k ) operation starts. By Lemma 12, k ≤ k .
Lemmas 10, 16 and 17 immediately imply:
Theorem 18. Let S L be the general m&m system induced by a bag
If at most t L processes crash in S L , for every process w in S L , Algorithm 1 implements an atomic SWMR register writable by w and readable by all processes in S L .
Lower bound
Theorem 19. Let S L be the general m&m system induced by a bag L = {S 1 , . . . , S m } of subsets of Π = {p 1 , p 2 , . . . p n }. If more than t L processes crash in S L , for some process w in S L , there is no algorithm that implements an atomic SWMR register writable by w and readable by all processes in S L .
Proof. Let S L be the general m&m system induced by a bag L = {S 1 , . . . , S m } of subsets of Π = {p 1 , p 2 , . . . p n }. Suppose for contradiction that: t > t L processes can crash in S L , but for every process w in S L , there is an algorithm A w that implements an atomic SWMR register writable by w and readable by all processes in S L (*). Since t > t L , by the definition of t L there exists two subsets P and P of the set of processes Π, of size n − t each, such that P and P do not intersect, and
no set S i in L contains both a process in P and a process in P
By (1), P , P , and Q = Π − (P ∪ P ) form a partition of the set of processes Π. Let the writer w be any process in P . Let A be an algorithm that tolerates t > t L process crashes in S L and implements an atomic SWMR register R that is writable by w and readable by all processes in S L ; this algorithm exits by our initial assumption (*).
Since |P ∪ Q ∪ P | = n, clearly |P ∪ Q| = |P ∪ Q| = n − (n − t) = t. Since algorithm A tolerates t crashes, it works correctly in every execution in which all the processes in P ∪ Q or in P ∪ Q crash.
We now define three executions E 1 , E 2 , and E 3 of algorithm A. These are illustrated in Figure 3 .
Execution E 1 of algorithm A is defined as follows:
• The processes in P ∪ Q crash from the beginning of the execution; they take no steps in E 1 .
• At some time t s w the writer w starts an operation to write the value v into the implemented register R, for some v = v 0 , where v 0 is the initial value of R. Since the number of processes that crash in E 1 is |P ∪ Q| = t, and the algorithm A tolerates t crashes, this write operation eventually terminates, say at time t 2 , for all i such that 1 ≤ i ≤ . We will prove that there are configurations C (ii) the set of messages sent by processes in P to processes in P , but not yet received, is the same in C 3 . This is shown by induction on i.
For the basis of the induction, i = 0, we take C 0 3 to be the configuration of the algorithm just before time t s r in E 3 . Since no process in P takes a step before time t Proof of Claim 19.3. Suppose, for contradiction, that at time t s r in E 3 , some shared register R that can be read by a process p in P does not have its initial value. By construction, E 3 is identical to E 1 until time t s r , and so only processes in P take steps before time t s r in E 3 . Thus, register R was written by some process p in P by time t s r in E 3 . Since R is readable by p ∈ P and is written by p ∈ P , R is shared by both p and p . Thus, there must be a set S i in L that contains both p and p -a contradiction to (2) . By Claim 19.3, the shared registers readable by processes in P have the same value (namely, their initial value) in C For the induction step, for each i such that 1 ≤ i ≤ , we consider separately the cases of s i being a step to send a message, receive a message, write a shared register, and read a shared register. In each case, it is easy to verify that, assuming (inductively) that C To complete the definition of E 3 , after time t e r we let processes take steps in roundrobin fashion. Whenever a process's step is to receive a message, it receives the oldest one sent to it; this ensures that all messages are eventually received. Processes continue taking steps in this fashion according to algorithm A.
Since E 3 is identical to E 1 up to and including time t e w , E 3 is indistinguishable from E 1 up to and including time t e w to all processes in P . This proves part (a) of the claim.
Note that in E 3 and E 2 , the processes in P : (a) take no steps before time t 19.2(c) ), this contradicts the assumption that A is an implementation of an atomic SWMR register R that tolerates t > t L crashes.
Note that the above proof does not depend on the type of objects shared by the processes in each set S i of the bag L. So it can be easily adapted to prove the following stronger result:
Theorem 20. Consider any m&m system S induced by a bag L = {S 1 , . . . , S m } of subsets of Π = {p 1 , p 2 , . . . p n }, where the processes in S i share arbitrary objects (not only registers). If more than t L processes crash in S, for some process w in S, there is no algorithm that implements an atomic SWMR egister writable by w and readable by all processes in S.
SWMR register implementation in uniform m&m systems
Let G = (V, E) be an undirected graph where V = {p 1 , p 2 , . . . p n }, i.e., the nodes of G are the processes p 1 , p 2 , . . . p n . Let S G be the uniform m&m system induced by G.
Recall that in S G , the neighbours of every process p i in G share some SWMR registers R i [ * ] that can be read by (and only by) themselves; intuitively these registers are located at p i . We now use the connectivity of the graph G to determine the maximum number of process crashes that may occur in S G such that it is possible to implement a shared atomic SWMR register readable by all processes in S G . To do so, we first recall the definition of the square of the graph G: G 2 = (V, E ) where E = {(u, v) | (u, v) ∈ E or ∃k ∈ V such that (u, k) ∈ E and (k, v) ∈ E}.
Definition 21. Given an undirected graph G = (V, E) such that V = {p 1 , p 2 , . . . p n }, t G is the maximum integer t such that the following holds. For all subsets P and P of V of size n − t each:
1. P and P intersect, or 2. some edge in G 2 connects a node in P to a node in P , i.e., G 2 has an edge (u, v) such that u ∈ P and v ∈ P .
Note that t G ≥ (n − 1)/2 .
We now prove that it is possible to implement an atomic SWMR register readable by all processes in the uniform m&m system S G if and only if at most t G processes may crash in S G . More precisely:
Theorem 22. Let S G be the uniform m&m system induced by an undirected graph G = (V, E) where V = {p 1 , p 2 , . . . p n }.
• If at most t L processes crash in S G , for every process w in S G , it is possible to implement an atomic SWMR register writable by w and readable by all processes in S G .
• If more than t L processes crash in S G , for some process w in S G , it is impossible to implement an atomic SWMR register writable by w and readable by all processes in S G .
Proof. By Definition 6, S G is the m&m system S L where L = {S 1 , S 2 , . . . , S n } such that S i = N + (p i ), i.e., for all i, 1 ≤ i ≤ n, S i is the set of neighbours of p i (including p i ) in the graph G. Recall that t L is the maximum t such that for all subsets P and P of V of size n − t each:
