The weakest failure detector to implement a register in asynchronous systems with hybrid communication by Imbs, Damien & Raynal, Michel
The weakest failure detector to implement a register in
asynchronous systems with hybrid communication
Damien Imbs, Michel Raynal
To cite this version:
Damien Imbs, Michel Raynal. The weakest failure detector to implement a register in asyn-
chronous systems with hybrid communication. [Research Report] PI-1972, 2011, pp.11. <inria-
00583155>
HAL Id: inria-00583155
https://hal.inria.fr/inria-00583155
Submitted on 5 Apr 2011
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destine´e au de´poˆt et a` la diffusion de documents
scientifiques de niveau recherche, publie´s ou non,
e´manant des e´tablissements d’enseignement et de
recherche franc¸ais ou e´trangers, des laboratoires
publics ou prive´s.
Publications Internes de l’IRISA
ISSN : 2102-6327
PI 1972 – Avril 2011
The weakest failure detector to implement a register in
asynchronous systems with hybrid communication
Damien Imbs* , Michel Raynal**
Abstract: This paper introduces an asynchronous crash-prone hybrid system model. The system is hybrid in the way the processes
can communicate. On the one side, a process can send messages to any other process. On another side, the processes are partitioned
into clusters and each cluster has its own read/write shared memory. In addition to the model, a main contribution of the paper concerns
the implementation of an atomic register in this system model. More precisely, a new failure detector (denotedMΣ) is introduced and
it is shown that, when considering the information on failures needed to implement a register, this failure detector is the weakest. To
that end, the paper presents an MΣ-based algorithm that builds a register in the considered hybrid system model and shows that it is
possible to extract MΣ from any failure detector-based algorithm that implements a register in this model. The paper also (a) shows
thatMΣ is strictly weaker than Σ (which is the weakest failure detector to implement a register in a classical message-passing system)
and (b) presents a necessary and sufficient condition to implementMΣ in a hybrid communication system.
Key-words: Asynchronous message-passing system, Atomic register, Distributed algorithm, Failure detector, Fault-tolerance, Hybrid
communication, Necessity proof, Process crash, Shared memory system, Weakest failure detector.
Le plus faible détecteur de fautes pour construire un registre dans un système à communication hybride
Résumé : Ce rapport introduit un détecteur de fautes pour les systèmes répartis à communication hybride (mémoire partagée et
passage de messages) et montre qu’il est le plus faible détecteur de fautes possible pour construire un registre dans un tel système.
Mots clés : Système réparti asynchrone, mémoire partagée, passage de messages, détecteur de fautes, optimalité.
* Membre du Projet ASAP: équipe commune avec l’INRIA, le CNRS, l’université Rennes 1 et l’INSA de Rennes, damien.imbs@irisa.fr.
** Membre senior de l’IUF et Projet ASAP: équipe commune avec l’INRIA, le CNRS, l’université Rennes 1 et l’INSA de Rennes, raynal@irisa.fr.
c©IRISA – Campus de Beaulieu – 35042 Rennes Cedex – France – +33 2 99 84 71 00 – www.irisa.fr
2 D. Imbs, M. Raynal
1 Introduction
1.1 Atomic register
Among the objects that allow concurrent processes to exchange information and cooperate to a common goal, the atomic register is
certainly the most fundamental. Such an object (let us denote it REG) provides the processes with two operations REG .read() and
REG .write(v). The read operation provides the invoking process with the value of the object, while the write operation associates a
new value v with the object.
Atomicity [10, 11] means that the (possibly concurrent) read and write operations issued on a register appear as if they have been
executed sequentially, and this “witness sequence” is (1) legal (a read returns the value written by the closest write that precedes it in
this sequence) and (2) respects the real time occurrence order on the operations (if the operation op1 terminates before an operation op2
starts, op1 appears before op2 in the witness sequence). Let us observe that concurrent operations can be ordered in any way as long as
the legality property stated in item (1) is satisfied.
1.2 Building an atomic register in a message-passing system
Simulating a register in an asynchronous system In an asynchronous message-passing system, the processes communicate by send-
ing and receiving message through channels and there are assumptions neither on the speed of processes nor on message transmission
delays.
If the system is reliable, it is easy to build an atomic register on top of an asynchronous message-passing system. This is no longer
the case if processes can crash. Let n be the number of processes that compose the system and t be a model parameter that defines an
upper bound on the number of processes that may crash. Algorithms that build an atomic register object despite asynchrony and up to
t < n/2 process crashes are described in [1, 2].
An important result is proved in [2], namely, there is no algorithm implementing an atomic register in asynchronous message-passing
systems where t ≥ n/2. The intuition that underlies this impossibility is that, due to asynchrony and the fact that t ≥ n/2, the system
can appear as being partitioned, in such a way that each partition considers that the processes in the other partition have crashed (while
they actually have not). The reader interested by a pedagogical introduction to these issues will consult [3, 12, 13].
The failure detector approach to circumvent the “t ≥ n/2” impossibility The failure detector approach [5, 6] has been introduced
to circumvent impossibility results. It consists in enriching each process of an unreliable asynchronous system with an additional device
(sometimes called “oracle”) that provides it with hints on process failures. According to the type and the quality of these hints, several
classes of failure detectors can be defined.
The class of quorum failure detectors, denoted Σ, has been introduced by Delporte-Gallet, Fauconnier and Guerraoui in [7]. (A
quorum is a set of processes. Quorums have first been introduced by Gifford [8].) It is shown in [4, 7] that Σ is the weakest class of
failure detectors that allow building an atomic register object in asynchronous message passing systems despite any number of process
crashes (i.e., in systems where t = n− 1). “Weakest” means that Σ captures the minimal information on failures that has to be known
by the processes in order to implement a register. The definition of Σ is given below. It is important to notice that, due to the results of
[2] and [7], it follows that Σ cannot be implemented in asynchronous message-passing systems despite any number of crashes.
1.3 Content of the paper
Towards new system models The advent of multicore architectures where processors share a common memory and the design of
clusters (where, for example, each cluster is a multicore system) communicating by message-passing opens the door for the design
of new computing models where processes communicate both by shared memory (intra-cluster communication) and message passing
(point-to-point communication).
Context and content of the paper This paper is on the construction of atomic registers in hybrid models (such as the one previously
described). It has several contributions.
• It first introduces a simple asynchronous crash-prone model, denoted SM _MPn,m[∅], that captures the previous intra-cluster and
point-to-point communication types (the meaning ofm will be defined later).
• The paper then introduces a new failure detector, denotedMΣ, and
– Presents and proves correct an algorithm that builds an atomic register in SM_MPn,m[MΣ] (SM_MPn,m[∅] enriched
withMΣ),
– Shows that MΣ is the weakest information on failures SM_MPn,m[∅] has to be enriched with in order an atomic register
can be implemented.
• The paper finally shows thatMΣ is strictly weaker than Σ. It also presents a necessary and sufficient condition to implementMΣ
in a hybrid communication system.
Collection des Publications Internes de l’Irisa c©IRISA
Weakest failure detector for hybrid communication 3
Roadmap The paper is made up of 7 sections. Section 2 presents the computation model SM_MPn,m[∅]. The new failure detector
class MΣ is introduced in Section 3. Then, Section 4 presents an algorithm that builds an atomic register in SM_MPn,m[MΣ]
and Section 5 shows that MΣ is optimal. Section 6 presents a necessary and sufficient condition to implement MΣ in a hybrid
communication system. Finally, Section 7 concludes the paper.
2 A hybrid communication system model
2.1 System model with hybrid communication
Process model The system comprises n processes denoted p1, . . . , pn. Each process pi is asynchronous (i.e., it proceeds to an arbitrary
speed) and sequential (it executes one step -base action- at a time). Π = {1, . . . , n} is the set of process identities.
A process can crash. A crash is a premature halt (after it has crashed, if it ever does, a process issues no more step). Let t be the
upper bound on the number of processes that are allowed to crash. We assume here t = n − 1 (this is sometimes the wait-free process
model).
Progress condition In the following we are interested in a system model whose algorithms satisfy the wait-freedom progress condition
[9]. When considering an algorithm implementing an atomic register REG , this means that a process that does not crash must return
from all its invocations of the operations REG .read() and REG .write().
Message-passing communication Processes can send and receive messages through reliable channels. It is assumed that any pair of
processes is connected by a bidirectional channel. Channels are reliable but asynchronous. Reliable means that messages are neither
corrupted, nor duplicated nor lost. Asynchronous means that, albeit finite, message transfer delays are arbitrary.
The sending and the reception of a message are atomic steps. The processes can also use a broadcast operation, but this operation is
not atomic (if a process crashes during a broadcast, an arbitrary subset of the processes receive the corresponding message).
Partially shared memory communication The n processes are partitioned into m, 1 ≤ m ≤ n, non-empty subsets P [1], . . . , P [m]
called clusters (i.e., ∪1≤x≤mP [x] = Π and ∀x, y : (x 6= y)⇒ (P [x] ∩ P [y] = ∅)).
Inside each cluster x, 1 ≤ x ≤ m, the processes in P [x] share a common read/write memory denoted MEMx . MEMx is composed
of a set of 1WMR (single-writer/multi-reader) atomic registers (this assumption is without loss of generality as multi-writer/multi-
reader atomic registers can be built on top single-writer/multi-reader atomic registers [3, 11, 12]). For notational convenience, we use
an index/array notation for every register of MEMx : if i ∈ P [x], MEM x[i] can be written only by pi and read by all processes in P [x]
(if i /∈ P [x], MEM x[i] is meaningless and pi cannot access MEM x).
Two examples of partially shared memory are depicted in Figure 1 where the communication channels are not depicted. In both
cases, we have n = 7 andm = 3 but the partitions are different.
p2 p3 p4 p5 p6 p7p1 p1 p2 p3 p4 p5 p6 p7
︸︷︷︸ ︸︷︷︸
P [3]P [2]P [1]
︸︷︷︸
P [2] P [3]
︸︷︷︸
P [1]
︸︷︷︸ ︸︷︷︸
MEM 1 MEM 3MEM 2 MEM 3MEM 2MEM 1
Figure 1: Two examples of partially shared memories
Notation As already indicated in the introduction, SM _MPn,m[∅] is used to denote the previous base wait-free hybrid distributed
computing model. In the following ∅ will be replaced by a failure detector to denote the corresponding enriched model. In Figure 1 we
have two instances of SM _MP7,3[∅].
Two particular cases The two extreme cases m = 1 and m = n are particularly interesting. The case m = 1 corresponds to the
case where all processes share a common read/write memory. In that case, as the read/write communication model is stronger than
the message-passing model, message-passing communication becomes useless and, consequently, SM _MPn,1[∅] is the classical shared
memory model.
When m = n, there is a single process in each partition and for each x, 1 ≤ x ≤ n, MEM x boils down to the local memory of a
single process. Hence, SM _MPn,n[∅] is the classical send/receive message-passing model.
Collection des Publications Internes de l’Irisa c©IRISA
4 D. Imbs, M. Raynal
2.2 An atomic register cannot be built in SM _MPn,m[∅] when m > 1
Theorem 1 Let 1 < m ≤ n. It is impossible to build an atomic register in SM _MPn,m[∅].
Proof The proof is a simple reduction to the impossibility theorem stating that there is no wait-free implementation of a register in an
asynchronous send/receive message-passing system [2, 3, 12, 13].
To that end, let us assume that there is an algorithm A that builds a register in SM _MPn,m[∅] and consider its executions in
SM _MPn,m[∅] where, in each partition, all processes but one crash before taking any step. As A is wait-free, it follows that them > 1
remaining processes implement an atomic register in the system model SM _MPm,m[∅], i.e., in a pure message-passing system model.
This contradicts the existence of algorithm A and concludes the proof. ✷Theorem 1
3 A new failure detector class
3.1 Failure pattern and failure detector
The underlying time model is the set N of natural integers. This time notion is not accessible to the processes. It can only be used from
an external observer point of view to state or prove properties. Time instants are denoted τ , τ ′, etc.
Formal definitions The notions introduced here are from [6].
A failure pattern is a function F () such that F (τ) denotes the set of processes that have crashed by time τ . As crashes are stable,
we have ∀ τ : F (τ) ⊆ F (τ + 1). Given a run, let F be the set of processes that crash in that run (these are the faulty processes) and C
the set of processes that do not crash (these are the correct processes). We have F = ∪τF (τ) and C = Π \ F .
A failure detector history H with range R is a function from Π × N to R whose meaning can be interpreted as follows: H(i, τ) is
the output of the considered failure detector at process pi at time τ .
A failure detector D with range R is a function that maps each failure pattern F () to a non-empty set of failure detector histories
with rangeR. D(F ) is the set of behaviors (possible failure detector histories) that D can exhibit when the failure pattern is F .
On the operational side From an algorithm point of view, a failure detector can be seen as a distributed device that provides each
process pi with a read-only local variable whose value at time τ is H(i, τ).
3.2 The failure detector class Σ
As already indicated, this failure detector class [7] is the class of the weakest failure detectors that allow an atomic register to be
implemented in the base send/receive message-passing system model. Using the formalism introduced in Section 2, this means that
SM _MPn,n[Σ] is the weakest failure detector-based system model in which an atomic register can be built.
The range of Σ is the set of all non-empty subsets of processes (2Π \ ∅). Let Σi be the read-only local variable provided to pi by Σ.
Such a local output is called a quorum. This failure detector class is defined by the two following properties where Στi denotes the value
of Σi at time τ , i.e., Σ
τ
i = H(i, τ).
• Intersection. ∀ i, j ∈ Π, ∀ τ, τ ′ : Στi ∩ Σ
τ ′
j 6= ∅.
• Liveness. ∃ τ : ∀ τ ′ ≥ τ : ∀ ∈ C : Στi ⊆ C.
The intersection property states that any two quorums taken at any times intersect. This property prevents partitioning and is used to
maintain the consistency of the atomic register. The liveness property states that eventually a quorum contains only correct processes.
This property is used to allow a process to stop waiting for messages from crashed processes. Because any two majorities always
intersect, it is easy to see that Σ can be easily implemented in a message-passing system in which a majority of process never crashes.
3.3 The failure detector class MΣ
Definition This failure detector class is for the system model SM _MPn,m[Σ]. It consists of all the failure detectors that satisfy the
following properties where the quorumMΣi is the local output at process pi andMΣ
τ
i its value at time τ .
• Intersection. ∀ i, j ∈ Π, ∀ τ, τ ′ : ∃x, k, ℓ : (x ∈ [1..m]) ∧ (k ∈MΣτi ) ∧ (ℓ ∈MΣ
τ ′
j ) ∧ (k, ℓ ∈ P [x]).Liv n ss. ∃ τ : ∀ τ
′ ≥ τ : ∀ ∈ C : MΣτi ⊆ C.
The liveness property is the same as the one of Σ. The intersection property is more general. It states that any pair of quorums
(whose values are taken at any times) is such that each one contains a process and these two, processes share the same common memory.
This can be seen as an “indirect” intersection: MΣi and MΣj are not required to intersect “directly” but must include processes that
share the same memory.
Collection des Publications Internes de l’Irisa c©IRISA
Weakest failure detector for hybrid communication 5
Particular cases Let us first consider the case m = 1 (the model is then the classical base read/write shared memory model). In that
case, there is a single shared memory (MEM 1) and taking alwaysMΣi = {i} for each pi, both properties are always satisfied. Hence,
there is a trivial implementation ofMΣ in SM _MPn,1[∅] which means thatMΣ adds no computational power whenm = 1. This is in
perfect agreement with the fact that SM _MPn,1[∅] is the base read/write shared memory model in which atomic register are given for
free.
Let us now consider the case m = n (the model is then the classical send/receive message-passing model). In that case, there is
a single process per cluster x (e.g., P [x] contains only px). It follows that, for the intersection property to be true, we need to have
∀ i, j ∈ Π, ∀ τ, τ ′ : ∃k : (k ∈ MΣτi ) ∧ (k ∈ MΣ
τ ′
j ), i.e., ∀ i, j, ∀ τ, τ
′ : MΣτi ∩MΣ
τ ′
j 6= ∅. Hence, when considering m = n,MΣ
boils down to Σ, which means that SM _MPn,n[MΣ] and SM _MPn,n[Σ] define the same computational model.
4 MΣ is sufficient: building an atomic register in SM _MPn,m[MΣ]
This section presents and proves correct an algorithm that builds an 1WMR atomic register in SM _MPn,m[MΣ]. The writer is denoted
pw. The atomic register that is constructed is denoted REG .
4.1 Description of the algorithm
The algorithm, described in Figure 2, is a simple adaptation to the hybrid model of the algorithm described in [2] that builds an atomic
register in a message-passing system where a majority of processes are correct. As already indicated, while the operation send is atomic,
the operation broadcast is not.
This algorithm is not designed with efficiency in mind. Its aim is only to show that an atomic register can be built in SM _MPn,m[MΣ],
and consequently show thatMΣ is sufficient. (Let us remember that, whenm = 1, the underlying message-passing system can be easily
simulated on top of the shared memory.)
The variables implementing the atomic register REG Let pi be a process and x its cluster (i.e., i ∈ P [x]). Process pi stores its
“local copy” of REG in MEM x[i]. More precisely, this base register has two fields MEM x[i].val (which stores the last value of REG
know by pi) and MEM x[i].sn (which stores the corresponding sequence number).
The variables in italics with subscript s are variables that are local to process ps. These local variables are used to generate local
sequence numbers.
The operation REG .write(v) This operation (which can be issued only by pw) first associates a new sequence number (snw) with
its current invocation (line 01). Then, it sends the message WRITE(v, snw) to all the processes to inform them on the new write (line
02). When, MΣw is such that pw has received a matching acknowledgment from each of its processes, the operation returns ok and
terminates (lines 02-04).
Let pi be a process such that i ∈ P [x]. When pi (pi can be pw) receives a message WRITE(v, seqnb) from a process pj it updates
MEM x[i] if this message carries a more recent write (line 12). Moreover, pi always sends by return an acknowledgment carrying seqnb
(line 13) to inform pj that its “local copy” of REG has now a sequence number which is ≥ seqnb.
The operationREG .read() This operation proceeds in two phases. In the first phase (lines 05-08), pi broadcasts a message READ(r_sni)
where r_sni is used to identify all its read invocations (lines 05-06) and waits until MΣi contains only processes from which pi has
received a matching acknowledgment (line 07).
When a process pk receives such a message READ(r_sn) from a process pj , it computes the most recent value of REG stored in the
cluster shared memory MEM x, i.e., such that k ∈ P [x] (lines 14-15) and sends back to pj this most recent value (line 16).
Finally, pi determines the most recent value ofREG it has received from the processes inMΣi (line 08). That value will be returned
by the read operation (line 11), but before, pi has to execute the second phase (lines 08-10) whose aim is ensure that no overwritten
value is ever returned by a read operation. To that end, pi simulates a write of the value it is about to return.
4.2 Proof of the algorithm
Theorem 2 Let 1 ≤ m ≤ n. The algorithm described in Figure 2 is wait-free construction of a 1WMR atomic register in SM _MPn,m[MΣ].
Proof We have to show that (a) any invocation of an operation by a correct process terminates whatever the number of process crashes
(wait-freedom) and (b) all operation invocations (except possibly, for each process, its last operation if it is faulty) can be totally ordered
in such a way that (when considering this total order) no operation returns an overwritten value and if the operation invocation inv1
terminates before the operation invocation inv2 starts then inv1 appears before inv2 in the total order.
Collection des Publications Internes de l’Irisa c©IRISA
6 D. Imbs, M. Raynal
operation REG.write(v): % This code is only for the single writer pw %
(01) snw ← snw + 1;
(02) broadcast WRITE(v, snw);
(03) wait until
`
MΣw is such that ∀ j ∈MΣw: pw has received ACK(snw) from pj
´
;
(04) return(ok).
———————————————————————————————————————–
% The code snippets that follow are for every process pi, 1 ≤ i ≤ n%
%Moreover, the value x denotes pi’s partition number i.e., x is such that i ∈ P [x] %
operation REG.read():
(05) r_sni ← r_sni + 1;
(06) broadcast READ(r_sni);
(07) wait until
`
MΣi is such that ∀ j ∈MΣi: pi has received VAL(v, sn, r_sni) from pj
´
;
(08) 〈v, sn〉 ← (〈v, sn〉 | VAL(v, sn, r_sni) received ∧ ∄ sn′ > sn : VAL(−, sn′,−) received);
(09) broadcast WRITE(v, sn);
(10) wait until
`
MΣi is such that ∀ j ∈MΣi: pi has received ACK(sn) from pj
´
;
(11) return(v).
Task T1: when WRITE(v, seqnb) is received from pj :
(12) if (MEMx [i].sn < seqnb) then MEMx [i]← 〈v, seqnb〉 end if;
(13) send ACK(seqnb) to pj .
Task T2: when READ(r_sn) is received from pj :
(14) mem← {MEMx [k] such that k ∈ P [x]};
(15) 〈v, sn〉 ← (mem[k] | ∄ℓ : mem[ℓ].sn > mem[k].sn);
(16) send VAL(v, sn, r_sn) to pj .
Figure 2: Building an atomic 1WMR register SM _MPn,m[MΣ]
Proof of Item (a). Let pi be a correct process that invokes an operation on REG . Let us observe that the only location where pi could
block forever is when it executes a wait statement where it is waiting for messages from the processes inMΣi. The proof follows from
the following observations: (1) Each wait statement follows a broadcast identified by a sequence number, (2) each broadcast message is
answered by all correct processes, (3) channels are reliable, and (4) eventuallyMΣi contains only correct processes.
Proof of Item (b). Let us observe that, as there is a single writer, and it is sequential, the invocations of REG .write() are totally
ordered and this order is in agreement with their sequence numbers. Hence, let us initialize S to the sequence of all write invocations
ordered according to their sequence numbers (that sequence respects their real-time occurrence order).
Considering an invocation of REG .read() issued by a process pi, let sn be the sequence number of the value returned by that
invocation. Each invocation of REG .read() is added to S after the write whose sequence number is sn and before (if it exists) the one
with sequence number sn + 1. Moreover, if two read invocations obtains the same sequence number, the one that started first is placed
in S before the other one. It follows from its definition that S is a correct register history (i.e., no read obtains an overwritten value in
S).
It remains to show that S respects the real-time occurrence order on the operation invocations. As we have seen, this is already the
case for the invocations of the write operation. For the invocations of REG .read() there are two cases to consider.
• The read invocation is issued (in real-time) after the αth write has terminated. To prove that it appears in S after this write we
show that the sequence number obtained by this read invocation is ≥ α.
Let pi be the process that issued this read operation and MΣ
τ
i be the quorum value that allowed it to stop waiting at line 07 of
its read invocation. Let MΣτ
′
w be the quorum value that allowed pw to stop waiting at line 03 of the its αth write invocation. As
pi started reading after pw has terminated the αth write, we have τ > τ
′. Moreover, due to the intersection property of MΣ,
∃x, k, ℓ : (k ∈ MΣτi ) ∧ (ℓ ∈ MΣ
τ ′
w ) ∧ (k, ℓ ∈ P [x]). It then follows from lines 12-13 executed by pℓ just before it sent back
ACK(α) to pw that we have MEM x[ℓ].sn ≥ α.
Moreover, as τ > τ ′ and k ∈ MΣτi , it follows that when pk (triggered by the message READ(r_sn) from pi) executes lines 14-
15, we have MEM x[ℓ].sn ≥ α. Hence, pk sends to pi the message VAL(−, β, r_sn) where β ≥ α (line 16). Consequently, the
sequence number computed by pi at line 07 of the read operation is ≥ α which proves the case.
• The second case is when there are two non-concurrent read invocations such that the first one obtains the sequence number sn1
and the second one obtains the sequence number sn2 (these sequential read invocations can be from different processes and
concurrent with write invocations). We need to show that sn1 ≥ sn2.
Let pj be the process that issued the first read invocation. The reasoning is the same as in the previous item after having replaced
in that item the write invocation issued by pw by the lines 09-10 executed by pj during its read invocation.
✷Theorem 2
Collection des Publications Internes de l’Irisa c©IRISA
Weakest failure detector for hybrid communication 7
5 MΣ is necessary
5.1 MΣ is the weakest failure detector for a register in a hybrid communication model
The previous section has shown that an atomic register can be built in SM _MPn,m[MΣ], thereby showing that enriching SM _MPn,m[∅]
with MΣ is sufficient (from an “information on failures” point of view) when one wants to build an atomic register. This section ad-
dresses the necessity side. It shows that any failure detector D such that an atomic register can be built in SM _MPn,m[D] provides
enough information on failures in orderMΣ can be built in SM _MPn,m[D].
Let D be any failure detector such that there is an algorithm A that allows building an atomic register in SM _MPn,m[D]. The
proof of the “necessity” part consists in showing that it is possible to build a failure detector of the class MΣ from A executed in
SM _MPn,m[D]. In the failure detector parlance, we say that it is possible to “extract” Σ from A. In a very interesting way, the
proposed extraction algorithm is the one we have presented in [4, 13] (for Σ) but its proof is different. Hence, the current paper shows
that the extraction algorithm introduced in [4] has a generic dimension with respect to failure detectors.
5.2 Bonnet-Raynal’s extraction algorithm
This section presents Bonnet-Raynal’s extraction algorithm introduced in [4] where it is assumed that the underlyingD-based algorithm
A builds an atomic register in SM _MPn,m[D]. Albeit not new, this presentation is needed for completeness of the minimality proof.
Arrays of atomic registers Let Q be a non-empty set of processes, and REGQ[1..n] an array of n atomic registers (initialized to
[⊥, . . . ,⊥]), such that each atomic register REGQ[x] is implemented by the n-process algorithm A executed only by |Q| threads, each
one associated with a process of Q.
A simple register-based algorithm Let WRQ be the following register-based algorithm (also called a task) where each process pi
such that i ∈ Q executes the following algorithm (where regi[1..n] is an array local to pi):
algorithm WRQ:
REGQ[i].write(⊤); for each x ∈ {1, ..., n} do regi[x]← REGQ[x].read() end for.
The process pi first writes the value ⊤ in its entry of the array REGQ, and then reads asynchronously all its entries. The
REGQ[i].write(⊤) and REGQ[x].read() operations are provided to the processes by the previous algorithm A. (Let us notice that
the value obtained by a read is irrelevant. As we will see, what is important is the fact that REGQ[x] has been written or not.) A
corresponding run of WRQ is denoted EQ. In that run, no process outside Q sends or receives messages related to the task WRQ.
Let us remember that C is the set of identities of the processes that are correct in the considered run. Let us observe that, as the
underlying failure detector-based algorithmA that builds a register is correct, if the setQ contains all the correct processes (i.e., C ⊆ Q),
EQ is such that every correct process terminates the task WRQ. In the other cases, i.e., for the tasks WRQ such that ¬(C ⊆ Q), EQ is
such that a process of Q either terminates WRQ, or blocks forever, or crashes. (This depends on the actual failure pattern, the outputs
of the underlying failure detectorD used by the algorithm A, and the code of A. As an example, let us consider the task WRQ, and two
correct processes pi and pj such that i ∈ Q and j /∈ Q. Let thi,Q be the thread of pi involved in Q. As j /∈ Q, the thread thj,Q does not
exist. The thread thi,Q can block forever when it executes A to read or write a register of REGQ[1..n] if, due to the output of D and
the code of A, it is directed to wait for a message from thj,Q -that does not exist-.
1).
Running concurrently 2n−1 tasks The extraction algorithm considers the 2n−1 distinct tasks WRQ whereQ is a non-empty set of
2Π. To that end, each process pi manages 2
n−1 threads, one for each subset Q such that i ∈ Q. Let us notice that the crash of a process
pi entails the crash of all its threads.
The extraction algorithm The algorithm that extracts Σ is described in Figure 3. Let us recall that the aim is to provide each process
pi with a local variable Σi such that the (Σx)1≤x≤n variables satisfy the intersection and liveness properties of Σ.
To that end, each process pi manages two local variables: a set of sets of process identities, denoted quorum_setsi, and a queue
denoted queuei. The aim of quorum_setsi is to contain all the setsQ such that pi terminatesWRQ (task T1), while queuei is managed
in such a way that eventually the correct processes appear in it before the faulty processes (tasks T2 and T3).
The idea is to select an element of quorum_setsi as the current output of Σi. As we will see in the proof, given any pair of
processes pi and pj , any quorum in quorum_setsi has a non-empty intersection with any quorum in quorum_setsj , thereby supplying
the required intersection property.
1A similar blocking can happen when the processes use an underlying Ω-based algorithm [5] and, in the considered run, the correct process that is eventually elected
as a leader does not participate in the algorithm.
Collection des Publications Internes de l’Irisa c©IRISA
8 D. Imbs, M. Raynal
The main issue is to ensure the liveness property of Σi (eventually Σi has to contain only correct processes) while preserving the
intersection property. This is realized with the help of the local variable queuei as follows: the current output of Σi is the set (quorum)
of quorum_setsi that appears as being the “first” in queuei. The formal definition of “first element of quorum_setsi with respect
to queuei” is stated in the task T4. To make it easy to understand, let us consider the following example. Let quorum_setsi =
{{3, 4, 9}, {2, 3, 8}, {1, 2, 4, 7}}, and queuei =< 4, 8, 3, 2, 7, 5, 9, 1, · · · >. The set S = {2, 3, 8} is the first set of quorum_setsi with
respect to queuei because each of the other sets {3, 4, 9} and {1, 2, 4, 7} includes an element (9 and 7, respectively) that appears in
queuei after the elements of S. (In case several sets are “first”, any of them can be selected). The notion of first quorum is used to
ensure the liveness of Σ, i.e., the set Σi of any correct process pi eventually contains only correct processes.
Init: quorum_setsi ← {{1 . . . , n}}; queuei ←< 1, . . . , n >;
for eachQ ∈
`
2Π \ {∅, {1, . . . , n}}
´
do
if (i ∈ Q) then launch a thread associated with the task WRQ end if end for.
% Each process pi participates concurrently in all the tasks WRQ such that i ∈ Q%
Task T1: when pi terminates in the task WRQ: quorum_setsi ← quorum_setsi ∪ {Q}.
Task T2: repeat periodically broadcast ALIVE(i) end repeat.
Task T3: when ALIVE (j) is received: suppress j from queuei; enqueue j at the head of queuei.
Task T4: when pi reads Σi:
letm = minQ∈quorum_setsi (maxx∈Q(rank[x])) where rank[x] denotes the rank of x in queuei;
return (a setQ such thatmaxx∈Q(rank[x]) = m).
Figure 3: Extracting Σ from a failure detector-based algorithm A that implements a register (code for pi)
Remark 1 Initially quorum_setsi contains the set {1, . . . , n}. As no set of processes is ever withdrawn from quorum_setsi (task
T1), quorum_setsi is never empty. Moreover, it is not necessary to launch the task WR{1,...,n} in which all the processes participate.
This is because, as the underlying failure detector-based algorithm A (that implements a register) is correct, it follows that all the correct
processes terminate in the task WR{1,...,n}. This case is directly taken into account in the initialization of quorum_setsi (thereby
saving the execution of the task WR{1,...,n}).
Remark 2 A simple examination of the extraction algorithm shows that (1) both the variables queuei and quorum_setsi are bounded,
and (2) messages carry bounded values, from which it follows that the construction is bounded.
5.3 Minimality of MΣ
Theorem 3 Let 1 ≤ m ≤ n. MΣ is the weakest failure detector SM _MPn,m[∅] has to be enriched with in order an atomic register
can be built.
Proof Let D be any failure detector such that there is an algorithm A that builds an atomic register in SM _MPn,m[D]. The proof
consists in showing that, given such an algorithm A, the algorithm described in Figure 3 builds a failure detectorMΣ.
Proof of the intersection property.
The proof is by contradiction. Let us first observe that the set Σi returned to a process pi is a set of quorum_seti (that contains the set
{1, . . . , n} -initial value- plus all the sets Q such that pi terminates WRQ). Let us assume that there are two sets Q1 and Q2 such that
(1)Q1, Q2 ∈
⋃
1≤j≤n(quorum_setj), and (2) ∀ x, k, ℓ : (k ∈ Q1∧ ℓ ∈ Q2)⇒ ({k, ℓ} 6⊆ P [x]). Let us notice that the first item means
that Q1 and Q2 can be returned to some processes as their local value for Σ. The second item means that at least one of k and ℓ is not
in P [x], from which we conclude that the processes in Q1 and Q2 cannot communicate via the shared memory cluster P [x].
Let pi be a process that terminates WRQ1 and pj a process that terminates WRQ2 (due to the “contradiction” assumption, such
processes do exist). Using the fact that the system is asynchronous, let us construct the runs EQ1 and EQ2 associated with WRQ1 and
WRQ2 as follows. If any, the messages sent by the processes of Q1 to the processes of Q2, when they execute A to implement each
register of the array REGQ1 , are delayed for an arbitrarily long period (until pi has added Q1 to quorum_seti and pj has added Q2
to quorum_setj). And similarly for the messages sent by the processes of Q2 to the processes of Q1 when they execute A for each
register of the array REGQ2 .
Let us observe that, in the concurrent runs EQ1 and EQ2 , the algorithm A that is executed only by (1) the processes of Q1 in EQ1
to build the registers REGQ1 [1..n], and (2) only the processes of Q2 in EQ2 to build the registers REGQ2 [1..n], is fed with the same
outputs of the underlying failure detectorD. Since (a) pi ∈ Q1 and pj ∈ Q2, and (b) ∀ x, k, ℓ : (k ∈ Q1 ∧ ℓ ∈ Q2)⇒ ({k, ℓ} 6⊆ P [x]),
pi does not write to REGQ2 [i] and pj does not write to REGQ1 [j]. Thus, pi reads ⊥ from REGQ1 [j], and pj reads ⊥ from REGQ2 [i].
Collection des Publications Internes de l’Irisa c©IRISA
Weakest failure detector for hybrid communication 9
Let us construct a run EQ12 , where Q12 = Q1 ∪ Q2, that is a simple merge of EQ1 and EQ2 defined as follows. In this run, the
algorithm A (that involves only the processes in Q12 and implements the array of registers REGQ12 [1..n]) is fed with the same failure
detector outputs as the ones supplied to the concurrent runs EQ1 and EQ2 . Moreover, the messages from Q1 to Q2 and from Q2 to Q1
are delayed as in EQ1 and EQ2 . So, pi (resp., pj) receives the same messages and the same outputs from the underlying failure detector
in EQ12 and EQ1 (resp., EQ2).
• On the one side, we have the following. As the process pi receives the same messages and the same failure detector outputs in
EQ12 as inEQ1 , the arraysREGQ1 [1..n] andREGQ12 [1..n] contain the same values. Consequently, pi reads⊥ fromREGQ12 [j].
Similarly, pj reads ⊥ from REGQ12 [i].
• On the other side we have the following. In EQ12 , the process pi writes ⊤ into REGQ12 [i] and the process pj writes ⊤ into
REGQ12 [j]. Moreover, one of these operations terminates before the other. Without loss of generality, let us assume that the write
by pi terminates before the write by pj . Consequently, pj reads REGQ12 [i] after it has been written. Due to the atomicity of that
register, it follows that pj obtains the value ⊤ when it reads REGQ12 [i].
The second item contradicts the first one. It follows that the initial assumption (existence of a failure detector-based algorithm A that
builds a register, Q1, Q2 ∈
⋃
1≤j≤n(quorum_setj) and ∀ x, k, ℓ : (k ∈ Q1 ∧ ℓ ∈ Q2) ⇒ ({k, ℓ} 6⊆ P [x])) is false, from which we
conclude that at least one of the assertions Q1, Q2 ∈
⋃
1≤j≤n(quorum_setj) and ∀ x, k, ℓ : (k ∈ Q1 ∧ ℓ ∈ Q2)⇒ ({k, ℓ} 6⊆ P [x]) is
false, which completes the proof of the intersection property ofMΣ.
Proof of the liveness property.
As far as the liveness property is concerned, let us consider the taskWRC (recall that C is the set of correct processes). As the underlying
failure detector-based algorithm A that implements the registers REGC [1..n] is correct (assumption), each correct process pi terminates
its REGC [i].write(⊤) and REGC [x].read() operations in EC . Consequently, in the extraction algorithm, the variable quorum_seti of
each correct process pi eventually contains the set C.
Moreover, after some finite time, each correct process pi receives ALIVE(j) messages only from correct processes. This means that,
at each correct process pi, all the correct processes eventually precede the faulty processes in queuei. Due to the definition of “first set
of quorum_seti with respect to queuei” stated in the task T4, it follows that, from the time C has been added to quorum_seti, the
quorum Q selected by the task T4 is always such that Q ⊆ C, which proves the liveness property ofMΣ. ✷Theorem 3
5.4 MΣ is strictly weaker than Σ when m < n
Theorem 4 Letm < n. The model SM _MPn,m[MΣ] is strictly weaker than the model SM _MPn,m[Σ].
Proof
To prove the theorem we have to show that (a) it is possible to build MΣ in SM _MPn,m[Σ] and (b) it is impossible to build Σ in
SM _MPn,m[MΣ] whenm < n.
• Proof of Item (a). For any i and τ , let us defineMΣτi = Σ
τ
i . As Σ
τ
i ∩Σ
τ ′
j 6= ∅, it follows that that ∃k ∈MΣ
τ
i ∩MΣ
τ ′
j and there
is trivially a partition x such that k ∈ P [x] which proves the intersection property ofMΣ. The liveness property ofMΣ follows
directly from its Σ counterpart.
• Proof of Item (b). The proof is by contradiction. let us assume that there is a wait-free algorithmA that buildsΣ in SM _MPn,m[MΣ]
whenm < n.
As m < n, there is a partition P [x] and a pair of processes pi and pj such that i, j ∈ P [x] (i.e., pi and pj belong to the same
memory partition x). Let us consider a run in which pi and pj are correct and all the other processes crash before taking any step.
As i, j ∈ P [x], ∀ τ, τ ′, MΣτi = {i} andMΣ
τ ′
j = {j} are correct local outputs ofMΣ (they satisfy its intersection and liveness
properties).
Let us suppose that, while it is executing A, pj pauses during an arbitrary long but finite period during which pi runs solo and
(due to asynchrony) receives no message from pj . As ∀ τ we have MΣ
τ
i = {i}, pi cannot distinguish this execution of A from
the one in which it is the only correct process. Hence, after some finite time, because it is wait-free, A has to output {i} at pi in
order the liveness property of Σ be satisfied. Hence, there is a time τi such that Σ
τi
i = {i}.
Let us now suppose that, after time τi, pi pauses for an arbitrary long but finite period during which pj runs solo and (due to
asynchrony) receives no message from pi. It follows from the same reasoning as before that there is a time τj at which we have
Σ
τj
j = {j}.
It follows that Στii ∩ Σ
τj
j = ∅, and the intersection property of Σ is violated which concludes the proof of the theorem.✷Theorem 4
Collection des Publications Internes de l’Irisa c©IRISA
10 D. Imbs, M. Raynal
Remark Let us observe from the second part previous proof (Item b) that, when the processes of all but one memory clusters crash,
MΣ is too week to give information on failures. Moreover, The next corollary follows from the previous theorem when we consider the
casem = 1 (read/write shared memory model in whichMΣ can be trivially implemented).
Corollary 1 Σ cannot be built from atomic registers only.
6 On the implementability of MΣ despite asynchrony and failures
When m = n (pure asynchronous message-passing system), MΣ boils down to Σ and it is known that Σ can be implemented in a
pure message-passing asynchronous system where a majority of processes are correct. Hence, the question: Is there a necessary and
sufficient condition C on n, m and a system parameter associated with failures such that MΣ can be implemented in SM _MPn,m[C]
(where SM _MPn,m[C] denotes the system model SM _MPn,m[∅] restricted to the runs where C is satisfied)? This section presents
such a necessary and sufficient condition C.
Notion of a faulty cluster Let us say that a cluster x is faulty in a run if all processes of P [x] are faulty in that run. Let t, 1 ≤ t < m
be the upper bound on the number of faulty clusters.
The next theorem shows that C = (t < m/2) is a necessary and sufficient condition to implementMΣ in SM _MPn,m[∅].
Theorem 5 Let COND be the set of all the predicates on n, m and t, C ′ ∈ COND and C = (t < m/2). MΣ can be built in
SM _MPn,m[C
′] if and only if C ′ ⇒ C.
Proof The proof of the theorem is made up of two parts: (a)MΣ can be built in all the runs in which C is satisfied and (b)MΣ cannot
be built in all the runs in which C is not satisfied.
Proof of Item (a). The algorithm described in Figure 4 builds MΣ in SM _MPn,m[t < m/2]. Initially, each process pi initializes
MΣi to Π (the set of all process identities). Then, repeatedly, pi broadcasts a message ALIVE(i), waits until it has received a message
from (m − t) processes belonging to different clusters and sets MΣi to set of processes. It is easy to show that the intersection and
liveness properties ofMΣ are satisfied.
• Let us first observe that, due to the assumption t < m/2, no correct process remains blocked forever in the wait statement.
Moreover, after some finite time, a correct process receives message only from correct processes. It follows directly from these
two observations that, after some finite time,MΣi contains only correct processes which is the liveness property ofMΣ.
• ∀ i, j ∈ Π, ∀ τ, τ ′, let us consider the values of MΣτi and Σ
τ ′
j . It follows from t < m/2 that Σ
τ
i contains processes belonging
to a majority of clusters, and similarly for Στ
′
j . As any two majorities intersect, we conclude that there is cluster x such that
k ∈MΣτi ∧ ℓ ∈MΣ
τ ′
j ∧ {k, ℓ} ⊆ P [x] which proves the intersection property ofMΣ.
MΣi ← Π;
repeat forever
broadcast ALIVE(i);
wait until
`
messages received from processes in (m− t) different clusters
´
;
MΣi ← the set of processes from which messages have been received at previous line
end repeat.
Figure 4: BuildingMΣ in SM _MPn,m[t < m/2] (code of pi)
Proof of Item (b). Considering that t ≥ m/2, let us partition the set of clusters in two sets QC1 and QC2 (i.e., QC1 ∩QC2 = ∅ and
QC1 ∪QC2 = ∪1≤x≤mP [x]). Due to asynchrony it is possible to delay for an arbitrary long period all the messages from the processes
in QC1 to the processes in QC2 and all the messages from the processes in QC2 to the processes in QC1. Then, the processes in QC1
cannot distinguish the case where the processes in QC2 have crashed or are only very slow and similarly for the processes of QC2 with
respect to the processes of QC1. The impossibility follows from this classical partitioning argument. ✷Theorem 5
7 Conclusion
This paper has first introduced a new distributed computing model with hybrid communication. On the one side, any pair of processes
can communicate by asynchronous message-passing. On the other side, processes are partitioned into clusters and, insider each cluster,
processes can communicate through a read/write shared memory.
Collection des Publications Internes de l’Irisa c©IRISA
Weakest failure detector for hybrid communication 11
Then, the paper has investigated the minimal information on failures that allows an atomic register to be implemented in such a hybrid
communication model. This minimal information on failures is captured by a new failure detector denotedMΣ (which generalizes the
failure detector Σ). The paper has also presented a necessary and sufficient condition on the number of faulty shared memory clusters
that, when satisfied, allowsMΣ to be implemented despite the net effect of asynchrony and failures.
References
[1] Attiya H., Efficient and Robust Sharing of Memory in Message-passing Systems. Journal of Algorithms, 34(1):109-127, 2000.
[2] Attiya H., Bar-Noy A. and Dolev D., Sharing Memory Robustly in Message Passing Systems. Journal of the ACM, 42(1):121-132, 1995.
[3] Attiya H. and Welch J., Distributed Computing: Fundamentals, Simulations and Advanced Topics, (2d Edition), Wiley-Interscience, 414 pages,
2004.
[4] Bonnet F. and Raynal M., A Simple Proof of the Necessity of the Failure DetectorΣ to Implement an Atomic Register in Asynchronous Message-
passing Systems. Information Processing Letters, 110(4): 153-157, 2010.
[5] Chandra T., Hadzilacos V. and Toueg S.: The Weakest Failure Detector for Solving Consensus. Journal of the ACM, 43(4):685-722, 1996.
[6] Chandra T. and Toueg S., Unreliable Failure Detectors for Reliable Distributed Systems. Journal of the ACM, 43(2):225-267, 1996.
[7] Delporte-Gallet C., Fauconnier H. and Guerraoui R., Tight Failure Detection Bounds on Atomic Object Implementations. Journal of the ACM,
57(4), Article 22, 32 pages, 2010.
[8] Gifford D.K., Weighted Voting for Replicated Data. Proc. 7th ACM Symposium on Operating System Principles (SOSP’79), ACM Press, pp.
150-172, 1979.
[9] Herlihy M.P., Wait-Free Synchronization. ACM Transactions on Programming Languages and Systems, 13(1):124-149, 1991.
[10] Herlihy M.P. and Wing J.L., Linearizability: a Correctness Condition for Concurrent Objects. ACM Transactions on Programming Languages
and Systems, 12(3):463-492, 1990.
[11] Lamport L., On Interprocess communication. Part I: Formalism. Part II: Algorithms. Distributed Computing, 1-2(2):87-103, 1986.
[12] Lynch N.A., Distributed Algorithms. Morgan Kaufmann Pub., San Francisco (CA), 872 pages, 1996.
[13] Raynal M., Communication and Agreement Abstractions for Fault-Tolerant Asynchronous Distributed Systems.Morgan & Claypool Publishers,
251 pages, 2010 (ISBN 978-1-60845-293-4).
Collection des Publications Internes de l’Irisa c©IRISA
