This paper presents the first (randomized) algorithm for implementing self-stabilizing group communication services in an asynchronous system. Our algorithm converges rapidly to legal behavior and is communication adaptive, namely, the communication volume is high when the system recovers from the occurrence of faults and is low once a legal state is reached. Communication adaptability is achieved by a new technique that combines transient fault detectors.
Still, it is assumed in [23] , that the system is started in a certain global state, and the transitions are from a predefined set of transitions-thus, the specification and algorithm presented in [23] are not designed for selfstabilizing systems.
A different approach (part of which is randomized) is used in [30] . Every processor periodically transmits a list of the processors with which it can directly communicate. A processor is considered "up" and connected as long as it can successfully transmit a "fresh" time-stamp, otherwise, it will be eventually discarded from the system. The algorithm presented in [30] may be a base for a self-stabilizing algorithm if, for example, each processor has access to a local pulse generator, such that the maximum drift between the pulse generators is negligible. Congress [2] is an elegant protocol for registration of membership information at (hierarchically organized) servers. Hierarchy of servers improves scalability. The design fits wide area networks using virtual links to define neighboring relations. Moshe [24] is a group membership service implementation that considers an abstract network service (such as Congress [2] ). Its group membership algorithm uses unbounded counters.
A self-stabilizing group membership service for synchronous systems is considered in [3] . A common periodic signal initiates a broadcast of the local topology of every processor and each processor uses the local topologies in order to compute the connected component to which it belongs. Unbounded signal numbers are used and changes in the group are discovered following a common signal.
Our Contribution
This paper presents the first randomized algorithm for implementing a self-stabilizing group membership service in asynchronous systems. We introduce and demonstrate the communication adaptive property of a self-stabilizing algorithm. A self-stabilizing algorithm is communication adaptive if the communication volume is high when the system recovers from the occurrence of faults and is low once a legal state is reached. New self-stabilizing algorithms, such as the composite transient fault detector, are combined with known self-stabilizing algorithms, such as the update algorithm, to form the final system. The communication complexity of the composite transient fault detector is compared with a new lower bound. The resulting system achieves the membership task within (an optimal) the time period that is in the order of the diameter of the communication graph.
Our group membership service can be extended to implement different levels of broadcast services, such as single-source FIFO, totally ordered, and causally ordered. In addition, in [21] we present a scheme that uses a vertex cover heuristic for resolving history conflicts in a way that minimizes the number of processors that have to change their history, i.e., minimizes the adjustment measure [16] .
The rest of the paper is organized as follows: The system settings appear in Section 2. Our algorithms for implementing a self-stabilizing group membership service appear in Section 3. Concluding remarks are in Section 4.
THE SYSTEM
The distributed system consists of a set P of communicating entities. We call each entity, p 2 P, a processor, and assume that 1 jPj ¼ n N, where N is an upper bound on the number of processors. The processors may represent a network of real physical CPUs, or correspond to an abstract entity like a process or thread in a timesharing system. Processors are connected by communication links through which they communicate by exchanging messages. The set of processors that processor p i can directly communicate with is called neighbors i . The communication link may represent a (real physical) communication channel device attached to the processor, a virtual link, or any interprocess communication facility (e.g., UDP and TCP).
It is convenient to represent a distributed system by a communication graph G ¼ ðV ; EÞ, where each node represents a processor and each edge represents a communication link. Let p i ; p j 2 P, p j 2 neighbors i if and only if ðp i ; p j Þ 2 E.
The system is asynchronous. We assume, however, that processors eventually identify the crashed/noncrashed status of their attached links and neighbors. Sometimes, we use the term time-out in the code for a repeated action of the processor. In fact, a zero time-out period will result in the desired behavior as well. The time-out period may only reduce the number of messages sent when processors have access to a time device.
A state machine models each processor. The communication links are modeled by two antidirected FIFO queues. The system configuration is a vector of the states of the processors and the values of the queues (the messages in the queues).
A communication operation is an operation in which a message is sent or received. We also allow a processor to send the same message to every one of its neighbors in a single communication operation. An atomic step of a processor consists of internal computations that are followed by a single communication operation. A system execution is an alternating sequence of configurations and atomic steps. A fair execution is an execution in which every processor executes an infinite number of steps.
The set of legal executions includes all the executions that exhibit the desired behavior (input-output relation) of the system for a task (. For example, if ( is the mutual exclusion task, then, at most, one processor is executing the critical section in any configuration of a legal execution.
A safe configuration of the system is one from which only legal executions, with respect to (, start. An algorithm is self-stabilizing with relation to a task ( if, in every fair execution of the algorithm, there is a safe configuration with relation to (.
We use a (randomized) self-stabilizing data-link algorithm on every link [1] . The existence of the self-stabilizing data-link algorithm eventually ensures that, when a message is sent, it arrives to its destination before the next message is sent. One way to devise this kind of selfstabilizing data-link algorithm is to let the sender randomly choose a new label for a message from a big enough set of labels such that the probability that a label of the right kind exists on the links from the sender to the receiver (a label of a message) and from the receiver to the sender (a label of acknowledgment) is low. This ensures that, eventually, the acknowledgments that the sender receives are sent after the receiver has accepted the message, and not a corrupted leftover acknowledgment [15] .
The self-stabilizing data-link algorithm allows us to assume the existence of input communication buffers or communication registers instead of message passing, whenever it is convenient. One can assume that the buffers contain at most one message, and the content of an arriving message replaces the previous content of the buffer.
The program of a processor used here consists of a doforever loop that includes a communication step with every neighboring processor. Let R be an execution and let A be a connected component of the system, such that no processor in A is crashed during R. The first asynchronous cycle of A in R is the minimal prefix of R, such that each processor p i in A communicates with all of its neighbors, at least one message m j is sent by p i to every neighbor p j , such that p j receives m j during the asynchronous cycle. The time complexity of an algorithm is measured by the number of asynchronous cycles in the algorithm execution. The convergence time is the maximal number of asynchronous cycles required to reach a safe configuration starting in an arbitrary configuration.
The number of messages sent over a particular communication link during an asynchronous cycle is a function of the number of loop iterations the attached processors execute during this asynchronous cycle (note that a processor may execute any number of iterations before another processor completes a single loop iteration). Thus, we consider a special execution to measure the communication complexity of an algorithm. A very fair execution is an execution in which every processor executes exactly a single iteration of its do-forever loop in every asynchronous cycle. The communication complexity is the total number of bits communicated over the communication links in a single asynchronous cycle of a very fair execution.
In this paper, the requirements are related to the eventual behavior of the system when the execution fulfills certain properties (unlike the requirements discussed in [11] ; see also, [27] ). Processors may crash and recover during the execution. The neighbors of a crashed processor eventually identify the fact that it is crashed. We require that a selfstabilizing algorithm for group communication service will reach a safe configuration within a certain number of asynchronous cycles in any execution (that starts in an arbitrary initial configuration), such that each processor p i has a fixed set of noncrashed neighbors during the execution. 1 We allow simultaneous existence of several groups; we do not consider, however, interaction between the groups. We therefore, choose a specific group g to describe the membership service. A Boolean variable member i (logically) represents the intention of p i to be included in g. A partition of the network may cause a "partition" of g as well, therefore, we associate the set of legal executions for the group membership task with the processors of a (fixed) connected component A, and include execution R, such that the following properties hold:
1. If the value of member i ¼ true (member i ¼ false) is fixed during R, then there exists a suffix of R, in which p i appears (does not appear, respectively) in all the views of group g in the connected component A.
2.
If the value of member i of every processor p i of group g in the connected component A is fixed during R, then there exists a suffix, in which all the views of group g in the connected component A are identical; that is, the views have the same list of members and the same view identifier. We note that the length of the prefix of R, before the suffix mentioned in the above requirements is achieved by our algorithms, is ÂðdÞ (which is the fastest possible).
A self-stabilizing algorithm is communication adaptive if and only if the maximal communication complexity, after reaching a safe configuration, is smaller than the maximal communication complexity before reaching a safe configuration. Intuitively, the design of communication adaptive algorithms allows fast convergence by exchanging many messages following the occurrence of faults and then allows low communication overhead when the system is back to normal operation.
SELF-STABILIZING GROUP MEMBERSHIP SERVICE
In this section, we present the first communication adaptive self-stabilizing algorithm for the membership service. Roughly speaking, a spanning tree of the system is constructed. This tree is used to execute the membership management tasks. The root of the tree is responsible for the management of membership requests and for establishing new views. Several transient fault detectors monitor the consistency of the tree and the membership information.
The transient fault detectors give fast indication of the occurrence of transient faults. Once a fault is detected, the system changes the state to a safe configuration executing a propagation of information with feedback (PIF) procedure [32] , [14] several times (choosing random identifiers for these executions to ensure eventual stabilization). The update algorithm informs each processor about the nodes in its connected component. The update algorithm stabilizes rapidly, taking ÂðdÞ asynchronous cycles before reaching a safe configuration. Unfortunately, the communication complexity of the update algorithm is OðjEjn log NÞ before and after a safe configuration is reached. In this section, we present an algorithm that reduces the communication complexity to OðjEj log N þ n 2 log NÞ ¼ Oðn 2 log NÞ, once a safe configuration is reached.
A transient fault detector is composed with the update algorithm to achieve the communication adaptability property (see [4] , [6] , [5] for definitions of transient fault detectors). The transient fault detector signals every processor whether or not it needs to activate the update algorithm. Our transient fault detector itself is obtained by using a new technique for composing such detectors.
In a manner of speaking, whenever a processor detects, by use of the transient fault detector, that the update algorithm is not in a safe configuration, the processor signals the other processors in the system to start the activity of the update. The processor stops signaling the other processors to operate the update algorithm when it receives an indication that a safe configuration is reached. Fig. 1 illustrates the components of the system and their mode of operation. The right side of the figure includes the combined transient fault detectors, the detector for the existence of a distributed rooted tree, the detector for the 1 . We note that, in our time complexity, we do not consider calculations, the time required to identify the status of the links and neighbors. consistency of the tree descriptions held by each processor, and the detector for the join/leave algorithm (details for the last detector can be found in [31] ). The transient failure detectors examine the system configuration to ensure it is in a safe configuration. Upon fault detection (bottom arrow), the transient fault detector activates the algorithms that appear in the left side of the figure. Namely, the selfstabilizing update algorithm that collects information in the fastest possible way (with high communication cost), the tree update algorithm that collects the distributed description of the rooted tree (to every processor in the system), and the fast convergence that shifts back to low gear (reducing the communication overhead) by activating the transient fault detector (upper arrow). The fast convergence ensures that the convergence algorithm finalizes the stabilization period before shifting back to the reduced communication mode of operation. Thus, the system is communication adaptive.
Self-Stabilizing Update
We use the self-stabilizing update algorithm of [13] , [16] . We now sketch the main ideas used by the update algorithm. We start with the data structure used by a processor. Each processor has a list of no more than N tuples hid; dis; parenti. When the update algorithm stabilizes, it holds that the list of a processor p i contains n tuples. In this list, there can only be exactly one tuple hj; dis; ki for each processor p j that is in the same connected component with p i . Thus, p i knows the identity of every processor p j in its connected component, the distance dis measured by the number of hops to p j , and a neighbor p k that is on a shortest path to p j .
The processors that execute the update algorithm repeatedly receive all the tuples from the tables of their neighbors and use the value received to calculate a new table (note that the current table is not used in calculating the new table). Every time a processor p i finishes receiving the tuples of its neighbors, it acts as follows: Let T U i be the set of all tuples that a processor p i reads from its neighbors. Processor p i adds 1 to the dis field of every tuple in T U i . Then, p i adds a tuple hi; 0; nili to T U i . If there are several tuples with the same id in the resulting T U i , then p i removes every such tuple, except a single tuple among these tuples-with the minimal dis value. Finally, p i removes every tuple hid; dis; parenti, such that there exists a positive z < dis and there is no tuple with dis ¼ z, in T U i . The resulting set in T U i is the new table of p i .
Transient Fault Detectors for Reducing Communication Overhead
The communication complexity of the update algorithm is OðjEjn log NÞ ( [13] , [16] ). Note that a naive approach for designing a transient fault detector is to repeatedly send T U i to every neighbor. With this approach, a fault will be detected whenever there is a change in the value of T U i (according to the update algorithm) when a message with T U j arrives from a neighboring processor p j . This results in a communication complexity that is identical to the communication complexity of the update algorithm.
In this section, we present a fault detector that reduces the communication complexity of our algorithm when the algorithm stabilizes (reaches a safe configuration). The communication complexity of the algorithm, when a fault detector is used, is Oðn 2 log NÞ.
The update algorithm informs each processor with nodes in its connected component. The task of the transient fault detector is to detect a fault whenever there exists at least one processor that does not recognize the set of processors in its connected component.
We present a new scheme for reducing communication complexity by combining two transient fault detectors. The first fault detector communicates short messages over all the links of the system and ensures that there is a marked rooted spanning tree. The short messages consist of the identifier of the common leader and the distance of the processor from this leader. The second transient fault detector assumes the existence of a spanning tree and communicates larger messages over the links of this tree. In fact, these messages consist of the description of the rooted spanning tree.
Transient Fault Detector for the Existence of a Rooted Tree
The code for the first part of the transient fault detector appears in Fig. 2 . In the code, we use the input hleader i , dis i , parent i i, which is defined by the output of the update algorithm. Let hl; d; pi be the tuple in T U i , such that l is the maximal value among the values of the leader variables in T U i . The value of hleader i , dis i , parent i i is assigned by the values of hl; d; pi. A change in the value of hleader i , dis i , parent i i, as well as in the neighbors i set, triggers fault detection. Lines 1 and 1a of the code ensure that information for detection of a fault is sent by every processor to its neighbors once every time-out period. Line 1b ensures that the processor for which leader i ¼ i has the value 0 in its dis variable and the value nil in the parent i variable. Line 2a ensures that all the processors have the same value in their leader variable, and the distance of the parent of each (nonleader) processor p i is one less than the distance of p i from the leader.
The directed graph T ¼ ðV ; EÞ is defined as follows: Each node of the graph V represents a processor in the system (and vice versa). There exists a directed edge ði; jÞ 2 E if and only if the value of the parent field of the processor p i in the tuple of T U i with the maximal id, is p j .
Correctness Proof of the First Transient Fault Detector
To prove the correctness of the fault detector, we show that if no processor detects faults during an asynchronous cycle, then T is an in-tree (as we now define), which is rooted at the common leader (the processor with the maximal identifier).
Definition 1.
A directed graph is an in-tree if the undirected underlying graph is a tree, and if every edge of the tree is directed toward a common root.
Lemma 1. If no fault is detected during the execution of one asynchronous cycle of the algorithm in Fig. 2 , then all the processors that belong to the same connected component have the same value in their leader field.
Proof.
A message with the current leader is repeatedly sent to all neighboring processors (line 1a). By line 2a, neighbors agree on the identity of the leader. Suppose, by way of contradiction, that a fault is not detected, but there are still two processors in the same connected component that have different values in their leader fields. Let p i and p j be two such processors that are in the minimal distance (among all the choices of two processors with different values in their leader fields). Clearly, if no fault is detected, it holds that the distance between p i and p j must be greater than one. Therefore, there must exist another processor, p k , that is on a minimal path from p i to p j . The leader of p k must be different from the leader of p i and/or the leader of p j , since the leaders of p i and p j are different. Hence, the path between p k and p i , or the path between p k and p j is shorter than the path between p j and p i , a contradiction. t u Lemma 1 proves that, if two processors with different leaders exist, then a fault is detected within a single asynchronous cycle.
Lemma 2. If no fault is detected during the execution of one asynchronous cycle of the algorithm in Fig. 2 , then the value of the parent field of every processor p i is either the identity of a neighbor, or is nil when the leader is p i .
Proof. The lemma is implied by the definition of the values that can be stored in a parent field and in line 1b of Fig. 2 . t u We define the directed graph T ¼ ðV ; EÞ as follows: Each node of the graph V represents a processor in the system (and vice versa). There exists a directed edge ði; jÞ 2 E if and only if the value of the parent field of the processor p i is p j . Proof. Suppose, by way of contradiction, that no fault is detected when there exists a cycle C ¼ p 1 ; p 2 ; . . . ; p l in T . Let d i be the value of the dis field for every processor p i 2 C. Without loss of generality, let d 1 be the minimal value among these d i values. By line 2a of the code of the algorithm and the definition of R, it holds that d l must be smaller than d 1 , contradicting our choice of d 1 . t u Lemma 4. If no fault is detected during the execution of one asynchronous cycle of the algorithm in Fig. 2 , then there exists exactly one leader in the system.
Proof. By the fact that every processor has a distinct identifier and by Lemma 1, it is not possible that more than one leader will exist when faults are not detected during the execution of an asynchronous cycle. Thus, we only have to consider the case in which there is no processor that is a leader and a fault is not detected.
Recall that a change in the input triggers fault detection as well. Let d i be the value of dis i field for every processor p i in the system, and let d m be the minimal value among these d i values. A processor p m with distance d m that has a parent p l will detect a fault when executing line 2a of the code. t u Lemmas 1 and 3 imply the next corollary. Corollary 1. If a fault is not detected during the execution of one asynchronous cycle of the algorithm in Fig. 2 , then T is a spanning in-tree of G rooted at the leader.
Next we show that, in the case that a spanning in-tree exists in the system and, in addition, the variables of the fault detector are not corrupted, then no fault is detected.
Lemma 5. Let R be an execution of the algorithm in Fig. 2 , in which it holds in every configuration of R that:
3. The value of the leader variable of every processor is l. 4. dis j is equal to the number of edges connecting p j to p l in T . No fault is detected in R.
A fault is detected only in lines 1b or 2a. We now show that, in case the assumptions of the lemma hold, no fault is detected by p i in any execution of lines 1b and 2a. By assertions 1 and 2 of the lemma, it must hold that parent l ¼ nil. By assertion 4, it holds that dis l ¼ 0; thus, p l does not detect a fault when executing line 1b. By assertion 3, any other processor p i (i 6 ¼ l) does not detect a fault in line 1b, since leader i 6 ¼ i.
To show that no fault is detected in line 2a, we use assertions 3 and 4. A message received from the parent in T must have a distance smaller by one than the value of dis i . Hence, the lemma is proven.
t u The last issue to address for this fault detector is the communication complexity of our solution. Proof. The messages sent by the algorithm are of the form hleader i ; dis i ; parent i i. Thus, the number of bits in each message is Oðlog NÞ. Since a message is sent periodically on each link, it holds that the communication complexity is OðjEj log NÞ.
The Tree Update Algorithm
Before we continue with the transient fault detector, let us add a mechanism to distribute the description of T to every processor in the system. We augment each processor p i with a variable T i that should contain the description of T . Let T i ðp j Þ be the component of T i that is connected to p i when the link from p i to p j is removed from T i . The processor, p i , repeatedly sends T i ðp j Þ to every processor p j 2 ðfparent i g [ children i Þ. The value of the parent i variable is defined by the value p j of the tuple hl; d; p j i in T U i , such that l ¼ leader i . The children i set includes every neighbor p j from which the last table T U j was received and includes a tuple hl; d; p i i, where l ¼ leader i . The processor, p i , repeatedly computes T i using the last values of T j ðp i Þ received from every processor p j 2 ðfparent i g [ children i Þ. The processor, p i , constructs T i from the above T j ðp i Þ, while adding the links that connect itself to the processors in ðfparent i g [ children i Þ.
Correctness Proof of the Tree Update Algorithm
We now prove the correctness of the tree update algorithm. In the proof, we consider an execution that starts in a safe configuration of the update algorithm and prove correctness of the tree update in such executions. A safe configuration of the update algorithm is a configuration in which the values of the tuples of all the processors are correct (and, therefore, are not changed in any execution that starts in such a safe configuration).
In the lemma, we use the term height of a processor p i in an in-tree, for the maximal number of edges in a path from a leaf in the tree to p i , such that the path does not include the root of the tree. Lemma 7. Consider any execution R of the tree update algorithm that starts in a safe configuration of the update algorithm and consists of at least h þ 1 asynchronous cycles. Let p j be a processor such that T ðp j Þ is a sub-in-tree of T (the in-tree as defined by the update algorithm) that is rooted at p j , and the height of p j is, at most, h. Let T j ðp j Þ be the description of the tree rooted at p j in the variable T j of p j . It holds that T ðp j Þ ¼ T j ðp j Þ in the last configuration of R.
Proof. The lemma is proved by induction on h.
Base case h ¼ 1: For the case h ¼ 1, the height of T ðp j Þ is zero; therefore, children i is an empty set. Thus, p j assigns T j ðp j Þ by a single node during the first asynchronous cycle of R.
Induction step: Let p j be a processor such that T ðp j Þ is at the height of, at most, h. By the induction assumption, it holds that T ðp j Þ ¼ T j ðp j Þ is in any configuration of R that follows the first h þ 1 asynchronous cycles. Thus, during the asynchronous cycle that follows the first h þ 1 asynchronous cycles, it holds that any processor p k of height h þ 1 receives the correct trees from its children and, therefore, constructs the correct tree of height h þ 1.
t u We say that a configuration, c, is safe with relation to the tree update algorithm if and only if c is safe for the update algorithm and for every processor p i T i ¼ T . Moreover, in any execution that starts in c, the value of T i is not changed (this last requirement implies, in fact, that any message in transit from p i to p j contains T i ðp j Þ, which is the portion of T connected to p i when the link from p i to p j is removed).
Corollary 2. The tree update algorithm reaches a safe configuration following the first OðdÞ asynchronous cycles and its communication complexity is Oðn 2 log NÞ.
Proof. The update algorithm constructs a rooted BFS such that the height of the root is no more than d. By Lemma 7, it holds that, in any configuration c which follows the first d þ 1 asynchronous cycles, T i ¼ T for every processor p i . A processor, p i , that executes the tree update algorithm sends messages with T i to parent i and children i . The number of bits in each message is Oðn log NÞ. Since a message is sent periodically only through the links of T , it holds that the communication complexity is Oðn 2 log NÞ.t u
Transient Fault Detector for Correct Description of the Tree
The second transient fault detector assumes the existence of a rooted spanning tree T , which is defined by the child parent relation, and ensures that every processor p i has the description of T in T i . Thus, the second transient fault detector ensures that every processor knows the set of processors in its connected component. Let us first describe the consistency test function in Fig. 3 , that is used by our transient fault detector. In the code, we use the input T i ; parent i ; children i , which is defined by the output of the tree update algorithm. The consistency test function uses a Boolean variable consistent. First, p i assigns true to the consistent variable (line 1 of Fig. 3 ). In line 2, p i checks T i to be a spanning in-tree. Lines 3, 4, and 5 test the child/parent relations of p i (according to the update algorithm) to be correct in T i . The function returns the final value of consistent.
The transient fault detector is presented in Fig. 4 . The fault detector will ensure that all local values of T are identical and that the local tree neighborhood of every processor appears in T . In the code, we use the input T i ; parent i ; children i , which is defined by the output of the tree update algorithm (see the description of the code of Fig. 3 for the values of the above inputs).
The processor, p i , repeatedly executes lines 1a and 1b. In line 1a, p i sends T i to its parent and children. Then, p i checks the consistency of T i according to the consistency test described in Fig. 3 , and detects a fault accordingly. Whenever p i receives T j from p j , p i checks whether T i ¼ T j , and detects a fault if this equation is not true (line 2a of the code).
Correctness Proof of the Second Transient Fault Detector
We are now ready to prove the correctness of the second fault detector, assuming that no fault is detected by the first transient fault detector. In other words, we must assume an execution in which assertions 1 through 4 of Lemma 5 hold.
Lemma 8. If no fault is detected during the execution of one asynchronous cycle of the algorithm in Fig. 4 , then it holds that T i ¼ T j for every p i and p j that are in the same connected component.
Proof. Suppose, by way of contradiction, that a fault is not detected, but there are still two processors in the same connected component that have different values in their T variables. The proof is completed by arguments similar to the arguments presented in the proof of Lemma 1. t u Lemma 9. If no fault is detected during an execution R of one or more asynchronous cycles of the algorithm in Fig. 4 , then in the last configuration of R:
The existence of an edge from p j to p k in T i implies that parent j ¼ p k , and the equality parent l ¼ p m implies the existence of an edge from p l to p m in T i .
Proof. Suppose that an edge from p j to p k is in T i . By Lemma 8, T i ¼ T j . Thus, since p j does not detect a fault while executing line 1b, it must hold that parent j ¼ p k . Suppose that parent l ¼ p m . By Lemma 8, it holds that T i ¼ T l . Thus, since p l does not detect a fault while executing line 1b, it holds that parent l ¼ p m in T l and, therefore, in T i as well.
t u The next corollary can be easily concluded from the above two lemmas.
Corollary 3. If no fault is detected during an execution R of one or more asynchronous cycles of the algorithm in Fig. 4 , then in the last configuration of R, it holds for every p i that T i ¼ T .
We now turn to prove that no fault is detected when the system is in a safe configuration for the tree update algorithm.
Lemma 10. Let R be an execution of the algorithm in Fig. 4 , in which the following assertions hold in every configuration: assertions 1 to 4 of Lemma 5, and the value of every T i variable is identical to T . No fault is detected in R.
Proof.
A fault is detected only in lines 1b or 2a of Fig. 4 . We now show that, in case the assumptions of the lemma hold, no fault is detected by p i in any execution of lines 1b and 2a. A fault is detected in line 1b only if T i is inconsistent according to the consistency tests. We, therefore, have to show that, in case the assumptions of the lemma hold, the inconsistency test (presented in Fig. 3 ) of T i returns true. By assertions 1 and 5 of the lemma, it holds that the conditions in lines 2, 3, and 4 of Fig. 3 are false and, therefore, a true value is returned by the function. By assertion 5 of the lemma, no fault is detected in any execution of line 2a. t u Lemma 11. The communication complexity of the transient fault detector algorithm presented in Fig. 4 is Oðn 2 log NÞ bits.
A processor p i that executes the second transient fault detector sends messages with T i to parent i and children i . Thus, the number of bits in each message is Oðn log NÞ. Since a message is sent periodically only on T links, it holds that the communication complexity is Oðn 2 log NÞ. t u Last, we combine both the fault detectors, where the first fault detector messages are augmented with the second fault detector messages. (Note that the second fault detector sends messages only on tree links. The messages sent by the first fault detector on nontree links are not augmented by a message of the second fault detector.) We conclude the presentation and correctness proof of the transient fault detectors by the following corollary. 
Lower Bound on the Communication Complexity
We now present a lower bound of ðn 2 logðN=n À 1ÞÞ bits on the communication complexity. The lower bound is for any fault detector that detects a fault within a single asynchronous cycle (whenever a processor has an inconsistent knowledge on the set of processors in its connected component or view). Recall that group membership services notify the application with the current view. To do so, we consider an asynchronous cycle that starts with all processors sending messages to every one of their neighbors (where a processor can send nil messages in case no message should be sent to a neighbor), and the cycle terminates after all messages sent are received. We examine processors p 1 ; p 2 ; Á Á Á p n , which are connected by a chain communication graph. Assume that n is even (a similar argument can be used for a chain with an odd number of processors) and assume that N ! 2n. Let m k;kþ1 (m kþ1;k ) be the message sent from p k to p kþ1 (from p kþ1 to p k , respectively). We claim that the number of distinct combinations of m k;kþ1 ; m kþ1;k must be at least ðn logðN=n À 1ÞÞ. Let p k be a processor in the chain and suppose that k n=2. Fix a set of k distinct identifiers for the processors p 1 ; p 2 ; Á Á Á ; p k . We prove a lower bound by using the number of possible choices of different sets of n À k distinct identifiers for the rest of the processors p kþ1 ; p kþ2 ; Á Á Á ; p n .
Let X 1 and X 2 be two such choices. Now, we describe two different systems that differ in the way we assign identifiers to processors p kþ1 ; p kþ2 ; Á Á Á ; p n . The identifiers of the processors p kþ1 ; p kþ2 ; Á Á Á ; p n in the first (or second) system are the identifiers in X 1 (X 2 ), respectively. Clearly, the communication over the edge connecting p k to p kþ1 must not be the same in the two systems above; otherwise, we may replace the two different portions of the two systems and no fault will be detected, while p 1 will not be aware of the different set of processors in the system. The case of k > n=2 is handled analogously, fixing a set of k distinct identifiers for the processors p nÀk ; p nÀkþ1 ; Á Á Á ; p n . In both cases, we conclude that the number of communication patterns needed are at least the number of choices of n À k distinct identifiers for the processors p kþ1 ; p kþ2 ; Á Á Á ; p n , out of N À k identifiers.
ðN À kÞ!=ððn À kÞ!ððN À kÞ À ðn À kÞÞ!Þ ¼ ðN À kÞ!=ððn À kÞ! ðN À nÞ!Þ ¼ ððN À n þ 1Þ Á Á Á ðN À kÞÞ=ðn À kÞ! ! ððN À n þ 1Þ=ðn À kÞÞ nÀk :
We assume that N ! 2n, thus, we have ðN À n þ 1Þ/ ðn À kÞ ! 1. By the assumption that 1 k n=2, we have ððN À n þ 1Þ=ðn À kÞÞ nÀk ! ððN À nÞ=nÞ n=2 :
Therefore, at least ðn logðN=n À 1ÞÞ bits are needed for the communication between m k;kþ1 and m kþ1;k . The communication complexity is a measure that considers all the links and, is therefore, ðn 2 logðN=n À 1ÞÞ bits.
Group Membership and Voluntarily Join/Leave
In a legal execution, only the user is privileged to change his/her membership status in a group. Such a change occurs in response to the application request. Here, we describe how, in a legal execution, processor p i may join (leave) a group g by locally setting (resetting), respectively, member i .
We use the self-stabilizing -synchronizer algorithm [15] to coordinate view updates and to detect transient faults.
The -synchronizer is designed to be executed on a spanning tree of the system-in our case, T . There are two alternating phases for the -synchronizer: propagation and convergecast. Every (successful) pair of consecutive propagation and convergecast phases is associated with a color chosen by the root that monitors the consistency of membership requests and updates.
In a legal execution, processor p r (the root of T ) is responsible for the membership updates. During the propagation phase, p r propagates the view it maintains, v r , together with its associated color. As v r propagates throughout T , every processor p i assigns v r to a local variable v i that maintains its view. The convergecast phase is intended to deliver a feedback on the termination of the broadcast phase, and to gather view-update requests from the application.
Processor p i updates the value of member i according to the requests of the user in its site. For the sake of simplicity, we assume that member i is updated just before p i reports to its parent on the completion of the convergecast phase. The values of member i are accumulated from every node in T . A leaf p l in T delivers the value of member l to its parent. A parent of a leaf processor p k concatenates the values of the member c received from its children p c , together with member k and delivers it to its parent, and so on. Once the convergecast phase terminates, the root sends the received concatenated information on the membership of all the processors, together with a view identifier (the view identifier is changed whenever the set of members is changed).
Processor p i accumulates the membership requests in the variable request i . We note that request i ½i is identical to member i . The entry, request i ½j, is reserved for the value of request j , where p j 2 children i . The variable request i is an array of bits that is associated with every processor in the subtree of T , which is rooted at p i (the kth bit in the array is the kth processor in a preorder traversal on this subtree). The view, v i , maintained by p i , is the tuple hid; membersi, where id 2 V iewIDs and members is an array of bits with a structure that is identical to request r (where p r is the root of T ).
The -synchronizer algorithm uses color variables. We use the color variables in order to control the flow of the membership information. Every processor p i uses the variable DC i 2 Colors (down color) and UC i 2 Colors (up color), where Color is a finite set of integers (Processor p i uses U i ½j to store the value of hDC j ; UC j ; v j ; request j i (p j 2 children i )).
Roughly speaking, the root of T repeatedly colors T with a new color. Suppose that p l is a leaf node in T ; then, the new color propagates "downward" and "upward" until it replaces the old color in the path from the root of T to p l . When propagating "downward," the new color is assigned in DC i by every node p i . A leaf node p l in T assigns the value of DC l to UC l . The convergecast phase starts from the leaves "upward" to the root, replacing the color in UC i of every node p i in T with the new color.
Due to length restrictions, the detailed description of the coloring procedure and the fitting transient fault detector are omitted from this version. The reader may find these details in [31] , where it is proved that a fault is detected within Oð1Þ asynchronous cycles.
Before we turn to describe the actions taken upon a fault detection, let us note that randomized transient failure detectors can be used as well. In a legal execution, our deterministic transient failure detector repeatedly sends the same message through each link. Thus, the randomized technique proposed in [25] , which uses a logarithmic size of the repeatedly sent message, can be used here to further reduce the size of the messages sent. In such a case, the failure detectors will detect a fault with high probability.
Fast Convergence
So far, we have discussed transient fault detection, without describing the action taken when a fault is detected. The goal of the technique presented here is to ensure a fast convergence in the cost of a higher communication complexity. Once the transient fault detector detects a fault, we would like to activate the self-stabilizing tree update algorithm to regain consistency as soon as possible and, then, switch back to using transient fault detectors.
Propagation of a Fault Detection
Once a processor, p i , detects a fault, it then propagates the fault indication to every other processor. Every tuple of the update tables is extended to include a state field, where the domain of the state is fsafe; dtc; actg. We use the term the source tuple of p i for the single tuple of T U i when i is in the id field. Then, p i starts the propagation by assigning the values hi; 0; nil; dtci in its source tuple. In the sequel, we use the fourth field of p i 's source tuple as the state of p i .
Every processor p j that has at least one state field in a tuple of T U j with a value not equal to safe executes the update algorithm, sending messages through every attached link. When p i sends the new value of T U i to p j , and the state of p j is safe, p j changes its state to dtc. The information on the fact that p i detects a fault propagates to the entire system in the same way.
Our goal is to ensure that every processor p k verifies that the tuples in the tables of the processors encode a fixed BFS tree rooted at p k and, therefore, the update algorithm is in a safe configuration. We then allow the system to switch back to using the transient fault detector.
A central tool in achieving an indication of the completion of the reconstruction of the BFS trees is the PIF procedure. The propagation is done by flooding the system with the new information in the way we described above (for the case of dtc). The propagating processor, p i , should receive feedback upon the completion of the propagation before finalizing the PIF procedure. The feedback is sent to a processor with a smaller distance than p i , which p i selects to be its parent in the tree. Every processor p j uses the distance variable of the tuple with id ¼ i in T U j as its (upper bound on the) distance to p i . A processor p j sends a feedback only when the maximal distance difference of p j to p i , and the distance of any neighbor p k of p j is 1. The fact that the value in the distance fields is an upper bound on the distance from p i guarantees that every neighbor p j of p i sends feedback when the value of its distance field is 1 and, therefore, has a fixed parent (namely, p i ). Moreover, p j sends a feedback only when every one of its neighbors has a distance of, at most, two. Thus, processors of distance two have the correct distance and, therefore, a fixed parent. Similar arguments hold for processors of greater distances, concluding that a fixed BFS rooted at p i exists when p i receives the feedback. More details can be found in [14] . We note that part of the new information that is propagated is a randomly chosen color that identifies (with high probability) the current PIF execution initiated by p i as a new PIF execution.
The fast convergence algorithm should ensure stabilization from an arbitrary state. We trace the activity of the system from the first fault detection. We would like the fault detection to ensure that every processor will start a PIF following the fault detection. Then, when every processor completes the PIF and verifies that its tree is a fixed BFS tree, we can stop executing the communication expensive tree update algorithm.
When p i detects a fault, it starts a PIF that causes every processor p j to either 1) change a state from safe to dtc and start a PIF, or when p j is in the state act, and 2) to execute at least one more complete PIF before changing state to safe.
The update algorithm is executed by p i whenever there exists a tuple in T U i with a state field not equal to safe. Otherwise, p i responds to any T U j message (sent by a neighbor p j ) by recomputing T U i accordingly, and sending T U i to p j , exclusively. (Note that the transient fault detector is disabled whenever there exists a tuple in T U i with a state field not equal to safe.)
We may conclude that, once a transient fault is detected and propagated to the entire system, it holds that 1) no processor is in a safe state, and 2) no transient fault is detected.
Upon Completing the Propagation of a Fault Detection
A processor p i that has completed propagating the fault indication (completing a single PIF) changes state to act. Then, p i waits for all other processors to complete their propagation of fault detection, reaching a system state in which no processor uses the failure detector to detect a fault and starts propagating an indication of a failure. In other words, when p i is in act state, p i repeatedly executes PIF until it receives an indication that no dtc tuple appears in any table.
The indication for the absence of dtc tuples is collected using a PIF query. The PIF procedure is used to query the values of the state fields using the following procedure: Every tuple of the update tables is extended to include a nodtc bit field. When p k chooses a new color, p k sets the nodtc bit to true and starts a PIF. Whenever there exists a tuple with the state dtc in T U j , p j sets the nodtc bit of every tuple in its table to false. Whenever p j sends feedback to its parent (as part of the PIF), p j also sends the and result of the nodtc bit values of its children tables and its own table. Thus, a single nodtc ¼ false results in a nodtc ¼ false feedback that arrives to p k .
We may conclude that, once the nodtc PIF query procedure is completed with nodtc ¼ true, no processor is in a dtc state (and the transient fault detectors are disabled). Furthermore, let p k be the first processor that changes its state from act to safe after processor p i has notified a fault detection. Let c be the configuration that immediately follows this state change of p k . We will prove that 1) the tree rooted at each processor in c is a fixed BFS tree, 2) the state field of every tuple in every table in c is act, and 3) no transient fault is detected.
Returning to Normal Operation
Once all the processors are in act state, the system is ready to return to normal operation. A processor p i changes state to a safe state when p i is in act state and finds out that no dtc state exists in the system. Still, p i does not activate the transient failure detector until all processors change state to a safe state. The processor, p i , repeatedly executes PIF queries until it finds that the state of all the processors is safe. Thus, when a processor returns to use the transient failure detector, all the processors are in a safe state; therefore, a fault detection will result in a global state, change to dtc, then to act, and at last to safe after reaching a safe configuration.
The PIF query initiated by a processor in a safe state uses an additional allsafe bit field. When p k chooses a new color, p k sets both the nodtc and the allsafe bits to true and starts the PIF procedure. Recall that a processor p j sets the nodtc bit to false whenever there exists a tuple with a dtc state in T U j . In addition, p j sets the allsafe bit to false whenever there exists a tuple with a state not equal to safe in T U j . Also, p k changes its state to dtc whenever there is a dtc tuple in T U k or a feedback with nodtc ¼ false arrives. If the feedback carrying the allsafe bit is true, then p k stops executing the update algorithm and starts using the transient fault detector. If the allsafe bit is false (and the nodtc bit is true), then p k assigns true to both nodtc and allsafe bits and repeats executing the PIF query.
We note that the tree description used by the transient fault detector should be identical in all the processors before switching back to normal operation. Thus, the allsafe bit is also used to indicate that the tree description of a processor and its neighbors are identical (otherwise, the allsafe bit that arrives in the feedback is false).
We may conclude that, when the feedback of the allsafe PIF query is true, it holds that all the processors are in a safe state. Furthermore, let p k be the first processor that returns to use the transient fault detector after p i propagates a fault detection. Let c be the configuration in which p k returns to use the transient fault detector. It holds that, in c, the system is in a safe configuration with relation to the update algorithm.
We now turn to a detailed presentation of the fast convergence algorithm. The code of this algorithm appears in Fig. 5 . In the code, we use the PIF and the PIF query procedures. A formal description of the PIF procedure can be found in [18] , [14] . The PIF procedure is extended to PIF queries (nodtc and allsafequeries), as described above.
Lines 1, 2, and 3 of the code describe the actions p i takes, according to its state. When p i is in a dtc state (line 1), p i executes a PIF (line 1a). Once the PIF is completed, p i changes its state to act (line 1b). When p i is in act state (line 2), p i repeatedly executes a PIF query to ensure that no dtc tuple exists in the system (line 2a). Then, p i changes its state to safe (line 2b). In a safe state (line 3), p i repeatedly executes a PIF query to ensure that all the states (of the processors and the state fields of the tuples) are safe states (line 3a). If there exists a dtc tuple, then p i changes its state to dtc (line 3b). If, indeed, there are only safe tuples in the system, then p i returns to use the transient failure detector (line 3c). Once the failure detector is operating, p i changes its state to dtc whenever a fault is detected (line 3c and 3d).
Correctness Proof of the Fast Convergence Algorithm
We now prove the correctness of the algorithm presented in Fig. 5 . Recall that each PIF starts by choosing a random color of the PIF (which enables p i to guarantee that, indeed, the feedback it receives is related to the current PIF). We say that configuration c starts p i 's PIF (query) if and only if c is the configuration that follows an atomic step in which p i chooses a new color.
In the correctness proof of the algorithm presented in Fig. 5 , we assume that any color choice results in a nonexisting color. If we let the color range be n 10 , then the probability is very small that the choice of a color by a processor p i might result in an existing color of a tuple with id ¼ i. Once a processor p i chooses a nonexisting color, all the tuples with id ¼ i have the same color, and p i chooses a new color different from the existing color. Moreover, a standard argument can be used to prove that the system stabilizes in OðdÞ expected asynchronous cycles, even when the choice results in an existing color [17] . Roughly speaking, the system has an overwhelming probability to start choosing nonexisting colors following the arbitrary configuration reached due to the wrong color choice.
Next, we state three properties that are related to the PIF procedure (these properties are basic properties of the PIF procedure, as described and proven in [18] ). Property 1. Let R be a fair execution and p i a processor that starts the PIF procedure in a configuration c. The feedback of the PIF procedure arrives within the first OðdÞ asynchronous cycles that follow c.
Property 2. Let R be a fair execution and p i a processor that starts the PIF procedure in a configuration c. Then, when the PIF feedback arrives in a configuration that follows c, the tree update tuples encode a fixed BFS tree rooted at p i .
Property 3. Let R be a fair execution, such that:
1. Processor p i starts a PIF in configuration c p 2 R, with a nonexisting color. 2. There exists a configuration c f 2 R, such that c f is the first configuration that follows c p , and also follows the arrival of the PIF feedback. 3. The state of p i is fixed between c p and c f . Let p k be a processor that is in the same connected component of p i . There then exists a configuration c k 2 R between c p and c f , such that hi; Ã; Ã; si 2 T U k implies that s is equal to p i 's state in any configuration between c k and c f .
We define a safe configuration with relation to the algorithm presented in Fig. 5 . As proof of correctness, we consider two fair executions, R 1 and R 2 , of the algorithm presented in Fig. 5 . In R 1 , no processor executes line 1a of the code (starting a P IF to propagate its dtc state), while in R 2 , a processor does execute this line. We show that, within OðdÞ asynchronous cycles, in both executions, R 1 and R 2 , a safe configuration is reached. The next lemma considers R 1 .
Lemma 12. Following OðdÞ asynchronous cycles of R 1 , the system is in a safe configuration with relation to the algorithm presented in Fig. 5 .
Proof. We will show that, if no processor executes line 1a, then all the processors must execute line 3c forever, ensuring that the system is in a safe configuration. First, we show that there must exist a suffix R 0 1 of R 1 in which no processor is in a dtc state. By Property 1, it holds that a processor in a dtc state must finish executing line 1a and, therefore, changes state to act, executing line 1b, in OðdÞ asynchronous cycles. Since no processor changes state to dtc in R 1 , it holds that R 0 1 must exist. Next, we show that there exists a suffix R 00 1 of R 0 1 , in which all the processors are in a safe state. Property 1 and the definition of R 0 1 implies that the execution of line 2a must terminate. Thus, every processor in act state must change its state to a safe state in OðdÞ asynchronous cycles. Note that, in case no processor changes state to dtc, the only state changes possible are dtc to act and, then, to safe. To complete the proof, note that, during R 00 1 , the execution of line 3a must terminate in OðdÞ asynchronous cycles with an indication that all the processors are in a safe state (otherwise, the definition of R is violated) and, therefore, reaching line 3c. By the fact that the transient fault detectors of all the processors are active and also by Corollary 4, a safe configuration is reached within OðdÞ asynchronous cycles.
t u Next, we consider R 2 , in which a processor p i executes line 1a of the code. We note that Lemma 12 implies that, if a processor executes line 1a during an execution, then it does so during the first OðdÞ asynchronous cycles of the execution; namely, before a safe configuration is reached. Lemma 13. Following OðdÞ asynchronous cycles of R 2 , the system reaches a configuration in which the value of every state field is safe.
Proof. Let p i be a processor that initiates a P IF executing line 1a of the code, such that p i is the first processor that receives a P IF feedback. Note that, by the definition of R 2 and Lemma 12, the feedback arrives to p i within the first OðdÞ asynchronous cycles. Let c f be the configuration that immediately follows this feedback arrival. The value of the state field of the tuple with id ¼ i in every T U of c f is dtc. Therefore, there is no processor in a safe state in c f . By Property 1, every processor that is in a dtc state in c f , changes its state to act within the first OðdÞ asynchronous cycles that follow c f . Moreover, no processor changes state to safe until it first receives an indication that there is no dtc state in any tuple of the system (line 2a of the code). Let p k be the first processor that changes state from act to safe (line 2b) following c f , and let c k be the configuration that immediately follows this change. No processor changes state to safe in the execution that starts in c f and ends in c k (and, therefore, no processor changes state to a dtc state). The processor, p k , executes a P IF query following c f and before c k finding that no dtc state value exists. Hence, there cannot exist a state field with a dtc value in c k .
Let p l be the first processor that activates its transient fault detector following c k , and let c l be the first configuration following c k in which p l executes line 3c. Note that, by Property 1, c l must appear within the first OðdÞ cycles that follow c k . The processor, p l , has executed a P IF query following c k , finding that all the processors are in a safe state before executing line 3c. Hence, the value of every state field in c l is safe. t u
The next lemma assumes that all the processors are in a safe state and that they follow a coordinated execution that must follow this configuration until the processors change states to a safe state again. This time, the configuration reached must be a safe configuration.
Lemma 14. Following OðdÞ asynchronous cycles of R 2 , the system is in a safe configuration with relation to the algorithm presented in Fig. 5 .
Proof. By Lemma 13, the system reaches a configuration in which the state of every processor is safe within OðdÞ asynchronous cycles. If no fault is detected, then we can conclude that the system is in a safe configuration, since every processor is executing line 3c of the code. Therefore, we may use Corollary 4. Otherwise, there exists a processor that assigns dtc to its state, causing every processor to assign dtc and to execute a P IF within the first OðdÞ asynchronous cycles that follow the detection. When the last processor changes state to act, the processors are ready to change state to safe. Using the fact that line 3a is executed, and by use of Property 2, it must hold that all the trees are fixed BFS trees. Furthermore, the tree descriptions used by the failure detectors of the processors are all identical (this is part of the P IF allsafe query). Thus, a safe configuration is reached and no processor detects a fault thereafter. t u
CONCLUDING REMARKS
This paper presents the first (randomized) asynchronous self-stabilizing group membership service. We believe that the new ideas presented in this paper will enrich the set of techniques used in the design of robust group communication services. For example, we do not utilize the idea of token passing for detecting a crash. Instead, we present a self-stabilizing scheme that detects a fault quickly (in a single asynchronous cycle) and is still communication efficient. Our membership service can serve as the base for additional group communication services such as group multicast service [21] .
