Abstract. It is an interesting question whether one can device highly fault tolerant distributed protocols that tolerate both processor failures as well as transient memory errors. To answer this question we consider self-stabilizing walt-free shared memory objects. In this paper we propose a general definition of a self-stabilizing walt-free shared memory object that expresses safety guarantees even in the face of processor failures. We prove that within this framework one cannot construct a self-stabilizing single-reader single-writer regular bit from single-reader single-writer safe bits. This impossibility result leads us to postulate a self-stabilizing dualreader single-writer safe bit as the minimal object needed to achieve selfstabilizing walt-free interprocess communication and synchronization. Based on this model, adaptations of well known walt-free constructions of regular and atomic shared registers are proven to be self-stabiliT.ing.
Introduction
The importance of reliable distributed systems can hardly be exaggerated. In the past, research on fault tolerant distributed systems has focused either on system models in which processors fail, or on system models in which the memory is faulty. In the first model a distributed system must remain operational while a certain fraction of the processors is malfunctioning. When constructing shared memory objects like, for instance, atomic registers, this issue is addressed by considering wait-free constructions which guarantee that any operation executed by a single processor is able to complete even if all other processors crash in the meantime. Originally, research in this area focussed on the construction of atomic registers from weaker (safe or regular) ones [VA86, Lain86, PB87, LTV89, IS92] . Later attention shifted to stronger objects (cf. [AH90, Her91] and many others).
In the second model a distributed system is required to overcome arbitrary changes to its state within a bounded amount of time. If the system is able to do so, it is called self-stabilizing. Self To develop truly reliable systems both failure models must be considered together. We briefly summarize recent theoretical research that addresses this issue. Anagnostou and Hadzilacos [AH93] show that no self-stabilizing, faulttolerant, protocol exists to determine, even approximately, the size of a ring. Gopal and Perry [GP93] present a 'compiler' to turn a fault-tolerant protocol for the synchronous rounds message-passing model into a protocol for the same model which is both fault-tolerant and self-stabilizing. A combination of selfstabilization and wait-freedom in the construction of clock-synchronization protocols is presented in [DW93, PT94] . Another approach to combining processor and memory failures is put forward by Afek et al. [AGMT92, AMT93] and Jayanti et al. [JCT92] . They analyze whether shared objects do or do not have wait-free (self)-implementations from other objects of which at most t are assumed to fail. Objects may fail by giving responses which are incorrect, or by responding with a special error value, or even by not responding at all. In so-called gracefully degrading constructions, operations during which more than ~ objects fail are required to fail in the same manner.
We are interested in exploring the relation between self-stabilization and wait-freedom in shared memory objects. A shared memory object is a data structure stored in shared memory which may be accessed concurrently by several processors through the invocation of operations defined for it. Self-stabilizing wait-free objects occur naturally in distributed systems in which both processors and memory may be faulty. We give a general definition of selfstabilizing wait-free shared memory objects, and focus on studying the selfstabilizing properties of wait-free shared registers. Single-writer single-reader safe bits--traditionally used as the elementary memory units to build these registers with--are shown to be too weak for our purposes. Focusing on registers, being the weakest type of shared memory objects, allows us to determine the minimal object properties needed for a system to be able to converge to legal behaviors after transient memory faults, as well as to remain operative in the presence of processor crashes.
Shared registers are shared objects reminiscent of ordinary variables, that can be read or written by different processors concurrently. They are distinguished by the level of consistency guaranteed in the presence of concurrent operations ([Lam86] ). A register is safe if a read returns the most recently written value, unless the read is concurrent with a write in which case it may return an arbitrary value. A register is regular if a read returns the value written by a concurrent or an immediately preceding write. A register is atomic if all operations on the
