The study of fault-tolerant and asynchronous parallel algorithms for the parallel random access machine (PRAM [8] ) has attracted a fair amount of recent attention. Several efficient algorithms have been designed for PRAMS that are subject to stop-failures or to processor delays, where this processor behavior is determined by adversaries of varying strengths. For example: asynchronous PRAMS are the subject of [1, 4, 5, 6, 9, 13, 18, 19], and fault-prone PRAMs are studied in [4, 11, 12, 13, 20] . The motivation of this research area is to bridge the gap between realizable parallel computers and the PRAM, with its unrealistic features of broad bandwidth memory access, processor synchrony and freedom from faults. Our work is in the area of asynchronous and fault-prone models, but we do use broad bandwidth access to shared memory as a means of providing redundancy when encountering faualts. For a detailed discussion of the general model used and how it can be realized see [4] .
Here, we reexamine the key problem of Write-All and remove a strong initialization assumption that has been used in all its previous solutions. Write-All was formulated in [11] in order to show that it is possible to combine efficiency and fau!.x tolerance in the presence of arbitrary dynamic fail-stop PRAM processor errors. Its solutions have been used to compile PRAM algorithms for architectures where asynchrony or processor failures are present. It can be formulated as follows:
Using P-processors write I's into all locations of an array of size N, where P < N.
Write-All captures the computational progress that can be naturally accomplished in unit time by a PRAM (when P = N). In the presence of asynchrony or failures, efficient solutions to Write-All (increasing the fault-free work by polylogarithmic factors only) are non-obvious. Note that, in all existing solutions it does not matter what is the initial state of the size N array. For example we assume it is all O's in [11, 4, 20] , but the algorithms would work even if the N locations were initialized using arbitrary O's and l's. A much more important assumption in all previous Write-All solutions was the initial state of additional auxiliary memory used (typically of SI(P) size). The basic assumption has been that:
The SI(P) auxiliary shared memory is cleared or initialized to some known value.
In theory, this is a natural, even if unstated assumption, for PRAMS [8] and RAMS (cf., Turing Machine auxiliary tapes are initially blank). However, given the definition of Write-All this dependence on clear space raises a legitimate "chicken-or-egg" objection. In practice, memory locations typically contain unpredictable values, and processes that need to use large blocks of memory cannot assume that it is cleared or is initialized tc a known value. In fact operating systems usually provide explicit services that allocate clear memory, e.g., calloc 0 in standard C libraries. Such allocation is predictably much more time consuming, even in the absence of failures.
It is easy to construct simple Write-All algorithms that do not assume dear shared memory, but they appear to use quadratic work. If the overall computation involves many steps, one can perhaps afford an expensive initialization phase and amortize its cost over subsequent efficient 1 INTRODUCTION 2 steps. Unfortnately, when Write-All building blocks are used in very fast (i.e., polylogarithmic parallel time) algorithms (e.g., to compute prefix sums or list ranking) auxiliary memory initialization cannot be amortized over the computation. Fortunately, we show that there is a way around this dilemma:
We present Write-All algorithms and algorithm simulations that do not require that the auxiliary memory is cleared prior to the computation.
Algorithms in the setting studied in the present paper have some similarities with the notion of a self-stabilizing system introduced by Dijkstra in [7] . Paraphrasing [7] , a system is self-stabilizing if and only if, regardless of the initial state the system can always make a state transition into another state, and the system is guaranteed to find itself in a legitimate state after a finite number of transitions. Our computations using initially contaminated memory can be viewed as self-stabilizing with respect to the state of shared memory. In order to describe our technical contributions we must now review the state-of-the-art of the algorithmics of Write-All.
For the worst case on-line stop-failures without restarts, Kanellakis and Shvartsman [11] gave an efficient (within a log 2 factor) algorithm for Write-All (algorithm W) and other key problems using an iterated Write-All paradigm. This paradigm was then employed independently by Kedem et al. [12] and Shvartsman [20] to extend the results of [11] to arbitrary PRAM algorithms. In addition, Kedem et al. [12] analyzed the expected behavior of several solutions to Write-All using a random failure model. Shvartsman [20] presented a deterministic optimal O(N) work execution of PRAM algorithms subject to worst case failures by exploiting parallel slackness with P < N/log 2 N. A simple randomized Write-All algorithm that can be used for simulating arbitrary PRAM algorithms on an asynchronous PRAM is presented by Martel et al. in [18] ; this simulation has very good expected performance when the adversary is off-line. Kedem et al. [13] have shown an fQ(N log N) lower bound on work, for any deterministic Write-All solution. In addition, they have shown an O(N 1 N) deterministic work upper bound on Write-All. Their upper bound is based on a variation of algorithm W, and it has been shown by Martel [16] that the same upper bound applies to algorithm W [11] .
For the worst case on-line stop-failures with restarts there has also been some progress. A parallel model where processors are subject to failures and restarts is examined by Buss et al. in [4] . This framework generalized previous models of robust parallel computations and in it Write-All has a subquadratic O(N 1 " 5 9 ) work solution. Martel et al. [17] presented several randomized solutions for list ranking and sorting that have very efficient expected work when the scheduling adversary is off-line. An efficient randomized solution for the Write-All problem wdz developed by Anderson and Woll in [1] for the asynchronous parallel model. They have also showed an existence proof for an algorithm achieving work O(N'+e) for any E > 0. General synchronous PRAM simulations are impossible using bounded resources on asynchronous PRAMS because of the impossibility result shown by Herlihy [10] . However the algorithms in [1] can be used with the restartable fall-stop model defined by Buss et al. [4] (which restricts asynchrony). We will take advantage of this since general simulations are possible in that model.
MODEL AND DEFINITIONS

3
Contributions:
We eliminate the assumption that any amount of clear initial memory is available for the failstop and fail-stop restartable algorithms. We develop deterministic fault-tolerant algorithms that can be ased to simulate PRAMS using contaminated memory, i.e., when the shared memory not containing the input is initially in an arbitrary and possibly illegal state. We also improve on the state-of-the-art robust prefix sums computations. More specifically:
1. In the no-restart fail-stop parallel model, any N-processor PRAM algorithm that runs in time r can be deterministically simulated uzing (N) contaminated memory on P fail-stop processors with work O(N + P log 3 N/(log log N) 2 + r . P log 2 N/log log N) for
I<P<N.
This simulation has an optimal range of processors, i.e., the work of the simulation is asymptotically equal to the work of the simulated non-fault-tolerant algorithm.
2. In the restartable fail-stop model, any N-processor PRAM algorithm that runs in time T can be simulated using O(N) contaminated memory on P = N restartable fail-stop processors with S = 0(r . Nl+c). 3. For the parallel prefix computation it is possible to improve on the oblivious simulations of non-fault-tolerant algorithm (e.g., the ones we get by using [12, 20] with conventional algorithms). In order to compute the prefix sums of N values using N processors, at least log N/log log N parallel steps are required [2, 15], and the known algorithms require at least log N steps. Therefore an oblivious simulation of a known prefix algorithm will require simulating at least log N steps. We improve this work of oblivious deterministic simulation by a factor of log N when the memory is clear, and by a factor of log log N when the memory is contaminated.
In the rest of the paper, we present the model in Section 2, contamination-tolerant algorithms are in Section 3, we cover general simulations and algorithm transformations in Section 4.
Model and definitions
The basis of our model is the restartable fail-stop cRcw PRAM that is discussed and justified by Buss et al. in [4] , except that the shared memory that does not contain the input is contaminated:
1. There are P PRAM processors. Each has a unique processor identifier PID E {0,..., P-I). 2. Shared memory is accessible to all processors; each processor has a constant size private memory. Each memory cell stores one word of size O(log max{N, P}). 3. The input is stored in N cells in shared memory.
4. The shared memory not containing the input is contaminated.
MODEL AND DEFINITIONS
4
To enable algorithm termination and sensible accounting of resources, the work of the processors is structured using update cycles. Each cycle consists of reading a small number of shared memory cells, performing a fixed time computation, and writing a small number of shared memory cells. The number of reads and writes per cycle is fixed, but depend on the instruction set of the PRAM. The fail-stop with restart failure model is defined as follows:
1. A failure pattern F (i.e., failures and restarts) is determined by an on-line adversary, that knows everything about the algorithm and is unknown to the algorithm.
2. Any processor may fail at any time in any update cycle, and it may later restart, provided: (i) at any time at least one processor is executing an update cycle that successfully completes;
(ii) single bit writes are atomic, i.e., failures can occur before or after a write of a single bit.
3.
Failures do not affect the shared memory, but the failed processors lose their private memory. Processors are restarted at their initial state with their PID as their only knowledge.
Condition 2(i) makes termination possible. Update cycles also serve as units of accounting. They do not constrain the instruction set of the PRAM, however the processors are not charged for the instructions of the update cycles that are not completed. (In the absence of update cycle accounting, a thrashing adversary can force quadratic work for any Write-All solution [4] .)
A failure pattern F is specified as a set of triples <tag, PID, t > where tag is either failure for a processor failure, or restart for a restart, PID is the processor identifier, and t is the time when the processor either stops or restarts. The size of F is defined as the cardinality ]F.
The complexity measure completed work generalizes the Parallel-timex Processors product: Definition 2.1 Consider an algorithm with P initial processors that terminates in paralleltime r after completing its task on some input data I of size II = N, and in the presence of any pattern F of failures and restarts of size IF _ M. If P(I, F) < P is the number of processors completing an update cycle at time i, and c is the time required to complete one update cycle, then we define completed work as: S = SN,M,P = maxl,F{c I 1 P(I, F)}. 0
Remark 1 The incomplete work cycles are not counted in S. When the restarts do not occur, then the maximum work spent in the incomplete cycles is bounded by O(P), since there can be no more than P failures. Therefore, for the fail-stop no-restart model, using completed work S yields the same results as using the available processor steps measure in [11].
We use the notation "Write-All(N, P, L)" to stand for an instance of fault-tolerant WriteAll that uses P processors and clear auxiliary memory of size L to initialize to 1 an array of size N. , [11, 12, 13, 20] , or the algorithms that can serve as Write-All solution, e.g., the addition algorithm in [5] or the maximum finding algorithm in [18] , invariably assume that a linear portion of shared memory is either cleared or is initialized to known values. Starting with a non-contaminated portion of memory, such algorithms and simulations axe able to perform their computation by "using up" the clear memory, and concurrently or subsequently clearing additional segments of memory needed for future iterations. We develop an efficient Write-All solution that requires no dear shared memory.
A Bootstrap procedure
We formulate a bootstrap approach to the design of fault-tolerant Write-All algorithms, such that the auxiliary memory is initially contaminated. The bootstrapping procedes in stages:
In stage 1 of our procedure, all P processors clear an initial segment of No locations in the auxiliary memory.
At the stage i of the procedure, we use P processors to clear N+ 1 memory locations with the help of Ni memory locations that were cleared in the stge i -1. One specific approach is to define a series of multipliers Go, G 1 , ... , G, such that Ni = l=o Gj. The high level view of such algorithm is given in Figure 1 . The algorithm consists of an initialization (lines 02-04) and a parallel loop (lines 04-09). We use a variation of this scheme below.
We next use the bootstrap approach to construct and analyze contamination-tolerant WriteAll algorithms in the fail-stop and restartable fall-stop models. We analyze algorithm Z for the following choice of parameters: we use Go = log N, and Gi = Gi-1 log N (for i > 0). In the initialization, all P processors traverse a list of size Go sequentially and clear it. Then, iteratively, the processors use algorithm W to clear increasingly larger sections of memory using the auxiliary memory cleared in the previous iteration ( Fig. 1,  lines 
Algorithm
05-07).
Algorithm W is a fail-stop (no restart) Write-All solution. It uses two full binary trees (represented as heaps in memory) and it consists of a loop in which the active processors synchronously iterate through the following phases:
WI: enumerate the processors in a bottom-up traversal of the processor tree; W2: allocate the processors in a divide-and-conquer top-down traversal of the progress tree; W3: work at the leaves; and W4: evaluate progress in a bottom-up traversal of the progress tree.
To avoid a complete restatement, the reader is urged to refer to [11] . Martel showed the following upper bound for algorithm W: Theorem 3.1 [161 Algorithm W with P processors, the progress tree with H leaves (P < H) and 2H -1 total nodes all initialized to zero and G array elements at each leaf, has the work of S = O ((H + P log H/log log H)-(log P + log H + G)) for any pattern of stop-failures.
Note that the above result and algorithm W can be used when P > H. As described in [4] , when there are P processors and the progress tree has H < P leaves, then it is sufficient for each processor to take its PID modulo H to assure uniform initial assignment of processors and to preserve the result.
Algorithm W stores its binary trees as linear arrays interpreted as heaps. Therefore the structure of the trees is unaffected by the state of the memory, because the heaps are imlicit. We next observe that the enumeration of the processors in phase W1 of algorithm W can be done in a bottom-up traversal of a contaminated processor tree. The pseudocode for this algorithm is given in Figure 2 . We call it algorithm Zeu,n. The surviving processors enumerate themselves using a standard logarithmic time algorithm based on addition. The contaminated memory cells are distinguished from the cells that contain valid values via the use of a single bit associated with each cell (a so called "deadman flag"). When a processor arrives at a node, Proof: We first evaluate and then total the work of the algorithm during each of the finite numbers stages of its execution. In each use of algorithm W, we will have G = log N as the number of memory locations associated with each leaf of the progress tree, and we will apply Theorem 3.1 with different instantiations of H to evaluate the upper bound of work.
Stage 0: Enumerate processors using Znum, then sequentially clear log N memory using all surviving processors. The work using the initial P0 5 P processors is: Wo = Po.log P+P-log N.
Stage 1: P 1 : Po :_ P. Using instance of Theorem 3.1 where H = log N, the work is:
W, = (log N + P, log log N/loglog log N). (log P, + log N + log log N).
WRITE-ALL ALGORITHMS
8
Stage i: Pi <Pi-I < N. Using instance where H = log' N:
The Final Stage r is when log' N = N/logN, i.e., T = 1 -1.
Totalling the work in all phases yields:
Simplifying 
Algorithm Zr for the restartable fail-stop model
Algorithm Zr is similar to algorithm Z, except that in each stage we will be utilizing a restartable Write-All algorithm. (Algorithm W that is not suitable when restarts are allowed, see [4] ). Other parameters of the bootstrap procedure are the same as for the fail-stop case.
In this analysis, we will be using an algorithm that was described and characterized with the following result by Anderson and Woll: This is an existential result, and we call this algorithm AW. The best known constructed deterministic algorithm has e log 2 3 -1 < 0.59 as uas shown by Buss et al. [4] (algorithm X, that can also be used with the bootstrap). Note that algorithm AW was developed for the asynchronous model, but it can be used in the restartable fail-stop model as well. The work of the algorithm in the asynchronous model is the same as its completed work in the restartable fail-stop model. We evaluate and then sum the work of the algorithm during each of the finite numbers stages of its execution. In each stage i > 1 of algorithm Z , we will use algorithm AW log N times to clear logi+' N memory locations. In each instance of use of Theorem 3.4, we will use 6 > 0 as the exponent, such that c/2 = b. This is done to simplify the final sum using the property that log N = O(N 6 ) for any 6 > C We also use P = N for clarity.
Stage 0: All processors linearly initialize the segment of shared memory of length log N using The work is: Wo = P" log N.
Stage 1:
The algorithm is applied log N times to clear a segment of shared memory of size log 2 N. Using instance where H = log N, the work is: W1 = (Plog 6 N) -log N.
Final Stage r where log' N = N/log N, i.e., r = log N/ log log N-1. Using the instance where H = log' N = N/logN, the work is: W, = (P(log' N) 6 ) . 
i=0 i=1
4 Algorithm simulations and algorithm transformations
Oblivious simulations
Using general simulation techniques [12, 20] , if S(N, P) is the efficiency of solving a Write-All instance of size N using P processors, and if a linear amount of clear memory is available, then a single N-processor PRAM step can be deterministically simulated using P fail-stop processors and work SW(N, P). Thus if the Parallel-time x Processors of an original N-processor algorithm is r . N, then the work S of the fault-tolerant version of the algorithm will be 0(,r S(N, P)).
For the setting with initially contaminated shared memory, using algorithms Z and Z, with the simulation techniques [12, 20] , we obtain the following results: Theorem 4.1 Any N-processor, r parallel time PRAM algorithm can be simulated using O(N) contaminated memory and F fail-stop CRCW processors with S = O(P log N/(log log N) 2 + r. N +r -Plog 2 N/loglog N) for 1 < P < N.
This simulation has optimal ranges: Corollary 4.2 Any N-processor, r parallel time PRAM algorithm can be simulated using O(N) contaminated memory and P fail-stop CRCW processors with S = 0(r . N) when:
(1) 1 < P < N(loglogN) 2 /log 3 N), or (2) 1 < P < NloglogN/log 2 N) and r > logN/loglogN.
ALGORITHM SIMULATIONS AND ALGORITHM TRANSFORMATIONS
10
In the restartable fail-stop model we get: Theorem 4.3 Any N-processor, r parallel time PRAM algorithm can be simulated using O(N) contaminated memory and N restartable fail-stop CROW processors with S = 0 ((1 + r) . Nl+').
Remark 2 Buss et al. [4] define an amortized complexity measure of overhead ratio a that measures the computational overhead of an algorithm relative to the necessary work and the number of failures that are encountered. The simulation in the restartable fail-stop model has overhead ratio per PRAM step of a = Ne. This overhead ratio can be made polylogarithmic by interleaving algorithm Z, with algorithm V as presented in [4] .
Improving oblivious simulations
In addition to serving as +he basis for oblivious simulations, any solution for the Write-All problem can also be readily used as a building block for custom transformations of efficient parallei algorithms into robust ones [11] . Custom transformations are interesting because in some cases it is possible to improve on the work of the naive oblivious simulation. These improvements are most significant for fast algorithms when a full range of processors is used, i.e., when N are used to simulate N processors, because in this case the parallel slack cannot be taken advantage of. For example in the models with clear initial memory, a factor of log N/loglog N was saved off the pointer doubling simulations [111, and using randomization and off-line adversaries, improvements can be obtained in expected work of other algorithms [17, 18] .
We next show how to obtain determinsitic savings in work for the prefix sums algorithm that occurs in solutions of several important problems [3] . Efficient parallel algorithms and circuits for computing prefix sums were given by Ladner and Fischer in [14] , where the prefix problem is defined as follows: Given an associative operation E on a domain D, and xi,... ,xn E D, compute, for each k, (1 < k < n) the sum =x.
Prefix sums can be computing robustly by using a naive simulation of a standard logarithmic time algorithm. When using P = N processors, the work of such simulation will be O(S.-log N).
Prior to dealing with prefix sums, we make a simple observation that improves on another general simulation. It follows from the fact that since algorithms W and AW, by their definition implement tree traversals, they can be used to implement an associative operation on N values: This saves a full log N factor over oblivious simulations. We extend Theorem 4.4 and show a robust prefix sum algorithm whose work complexity is O(S,). In the no-restart fail-stop model we have the following result:
