We study e cient deterministic parallel algorithms on two models: restartable fail-stop CRCW PRAMs and asynchronous PRAMs. In the rst model, synchronous processors are subject to arbitrary stop failures and restarts determined by an on-line adversary and involving loss of private but not shared memory; the complexity measures are completed work (where processors are charged for completed xed-size update cycles) and overhead ratio (completed work amortized over necessary work and failures). In the second model, the result of the computation is a serializaton of the actions of the processors determined by an on-line adversary; the complexity measure is total work (number of steps taken by all processors). Despite their di erences the two models share key algorithmic techniques.
Contents
The model of parallel computation known as the Parallel Random Access Machine or PRAM 14] has attracted much attention in recent years. Many e cient and optimal algorithms have been designed for it; see the surveys 13, 23] . The PRAM is a convenient abstraction that combines the power of parallelism with the simplicity of a RAM, but it has several unrealistic features. The PRAM requires: (1) simultaneous access (requiring signi cant bandwidth) to a shared resource, namely memory; (2) global processor synchronization; and (3) perfectly reliable processors, memory and interconnection between them. The gap between the abstract models of parallel computation and realizable parallel computers is being bridged by current research. For example, memory access simulation in other architectures is the subject of a large body of literature surveyed in 45]; for some recent work see 17, 37, 44] . Algorithms with initial memory faults are examined in 43] . Asynchronous PRAMs are the subject of 8, 9, 15, 34, 35] . Here we address the issues of synchronization and reliability of PRAM processors.
In 21] we show that it is possible to combine e ciency and fault-tolerance in many key PRAM algorithms in the presence of arbitrary dynamic fail-stop processor errors (when processors fail by stopping and do not perform any further actions). The key to such algorithm design is the following fundamental problem, called the Write-All problem 21]:
Given a P-processor PRAM and a 0-valued array of N elements, write value 1 into all array locations.
This problem was formulated to capture the essence of the computational progress that can be naturally accomplished in unit time by a PRAM (when P = N). In the absence of failures, this problem is solved by a trivial and optimal parallel assignment. However, it is not obvious how to design solutions that are e cient in the presence of failures or asynchrony. An algorithm for the Write-All problem that does a total of O(N log 2 N) work is given in 21] (algorithm W).
The iterated Write-All paradigm is employed independently by Kedem et al. 25] and Shvartsman 42] to extend the results of 21] to arbitrary PRAM algorithms (subject to fail-stop errors without restarts). In addition to the general simulation technique, 25] analyzes the expected behavior of several solutions to Write-All using a particular random failure model. A deterministic optimal work execution of PRAM algorithms is presented in 42] . The optimality is achieved in the presence of worst case failures given parallel slackness (as in 46] by Valiant) .
Despite the existence of optimal Write-All algorithms and N-processor PRAM simulations 42] that use speci c ranges of fault-prone processors, e.g., 1 P N= log 2 N, it was shown in 21] that no optimal solutions for the Write-All problem exist that use the range of processor 1 P N. The strongest known lower bound for Write-All is N + (P log N), where P N, shown by Kedem et al. 26 ] for a fail-stop no-restart model.
A simple randomized algorithm that serves as a basis for simulating arbitrary PRAM algorithms on an asynchronous PRAM is presented by Martel et al. 34] . This randomized asynchronous simulation has very good expected performance for the Write-All problem when the adversary is o -line. Kedem et al. 26] show an O(N log 2 N log log N ) deterministic work upper bound on Write-All for 1 INTRODUCTION 3 We also demonstrate a lower bound of N + (P log N) (when 3 P N) for the asynchronous PRAM, when certain atomic primitives (such as compare-and-swap or test-and-set) are used to access shared memory. Note that even given the lower bound of Kedem et al. 26] , our lower bound results are still of interest because: (a) they demonstrate that any improvement to the lower bound must take account of the fact that processors can read only a constant number of cells in constant time, (b) they present a simple processor allocation strategy that we use to advantage in Section 4, and (c) the proofs are simpler to understand and they use only the rst principles.
In Section 4 we present three e cient algorithms for the Write-All problem. The rst (algorithm V ) is a modi cation of the algorithm of Kanellakis and Shvartsman 21] for the fail-stop no-restart model, and runs on the restartable fail-stop model with completed work O(N +P log 2 N +M log N), where M is the number of failures. This algorithm is based on an analysis of the lower bounds in Section 3. The second (algorithm X) runs on both models in time O(N P log 2 3
2 ). The third (algorithm T) runs on both models in the case P = 3, using N + O(log N) compare-and-swap operations on the asynchronous model and N + O(log N) update cycles in the fail-stop restart model. This matches the lower bound when three processors are used.
In Section 5, we show how to use algorithms V and X to simulate any N processor PRAM on a restartable fail-stop P processor CRCW PRAM. A terminating execution of each simulated N processor step is guaranteed with O(log 2 N) overhead ratio, and (sub-quadratic) completed work O(minfN + P log 2 N + M log N, N P log 2 3 2 g), where M is the number of failures during the simulation of the particular step. The strategy is work-optimal when the number of simulating processors is P N= log 2 N and the total number of failures in each simulated step is O(N= log N).
The lower bounds presented in Section 3 apply to the worst-case work of deterministic algorithms and to the expected work of randomized and deterministic algorithms. Randomization does not seem to help, given on-line (non-prespeci ed) patterns of failures. For example, it is easy to construct on-line failure and restart (resp. no-restart) patterns that lead to exponential (resp. quadratic) in N expected performance for the algorithms presented in 34]. These stalking adversaries are described in Section 6, where we also conclude with some open problems.
Preliminary versions of this work were reported in 7, 22].
Motivation and relation to physical systems
The models we present and study are intended to capture certain features of actual systems.
Processor delay and failure: Processor delay is a feature of any multi-user environment, in which processing priorities are not speci ed by a single user. Processing time may be unexpectedly required by another user or by the underlying system. Processor failure may occur either because of a physical fault or because another entity in the system preempts processing time without saving the old state.
Communication delay and failure: Communication delay is a well-known feature of multiprocessor systems. Small communication delays are compatible with synchronization if the step time is su cient for the longest possible access time, but synchronizing by counting up to the longest possible access time eliminates any advantages due to caching and similar techniques. Communication failure may be due to memory operations of other processors. If the communication network reports the failure of an operation, the processor can re-attempt the access, and the situation can 2 MODELS OF COMPUTATION be modelled as a communication delay. If unannounced failures can occur, an algorithm must either check its write operations or ensure that omission of a write is not detrimental to performance. For the purposes of accounting, we treat delay and/or failure as occurring to the processors only. If memory operations are atomic and serializable, they may be assumed to be instantaneous, and the communication delays or access failures may be charged to the processor. The model allows communication delay and other latencies, even though it does not make explicit mention of them.
An architecture for a restartable fail-stop multiprocessor: The abstract model that we are studying can be realized in the architecture in Figure 1 . This architecture is more abstract than, for example, a realization in terms of hypercubes, but it is simpler to program in. Moreover, basic fault-tolerant technologies (as described in surveys 11, 18, 19]) contribute towards concrete realizations of its components.
1. There are P fail-stop processors (see 40]), each with a unique address and some local memory. 2. There are Q shared memory cells, the input of size N Q is stored in shared memory. These semiconductor memories can be manufactured with built-in fault tolerance using replication and coding techniques without appreciably degrading performance 39]. 3. Processors and memory are interconnected via a synchronous network (e.g., as in the Ultracomputer 41]). A combining interconnection network that is well suited for implementing synchronous concurrent reads and writes is studied in 27] (the combining properties are used in their simplest form only to implement concurrent access to memory). The network can be made more reliable by employing redundancy 1].
With this architecture, our algorithmic techniques become applicable; i.e., the algorithms and simulations we develop will work correctly, and within the claimed complexity bounds (under the uniform cost memory access assumption) when the underlying components are subject to the failures within their respective design parameters. For the processors, we allow any dynamic pattern of fail-stop failures and restarts.
2 Models of computation
The restartable fail-stop CRCW PRAM
We use as a basis the PRAM model of Fortune and Wyllie 14] , where all concurrently writing processors write the same value (common CRCW). Processors are subject to stop failures and restarts as in 40]. Our algorithms are described using the forall/parbegin/parend parallel construct.
1. There are P synchronous processors. Each processor has a unique permanent identi er (pid) in the range 0; : : :; P ? 1, and each processor has access to P and its own pid. 2. The global memory accessible to all processors is denoted as shared; in addition, each processor has a constant size local memory denoted as private. All memory cells are capable of storing (log maxfN; Pg) bits on inputs of size N. 3 . The input is stored in N cells in shared memory, and the rest of the shared memory is cleared (i.e., contains zeroes). The processors have access to the input and its size N.
In all our algorithms:
The PRAM processors execute sequences of instructions grouped in update cycles. Each update cycle consists of reading a small xed number of shared memory cells (e.g., 4), performing some xed time computation, and writing a small number of shared memory cells (e.g., 2).
The parameters of the update cycle, i.e., the number of read and write instructions, are xed, but depend on the instruction set of the PRAM; see 14] for a typical PRAM instruction set. The values quoted (4 and 2) are su cient for our exposition. It is an interesting question whether smaller values would su ce to implement e cient algorithms.
We use the fail-stop with restart failure model, where time instances are the PRAM synchronous clock-ticks:
1. A failure pattern F (i.e., failures and restarts) is determined by an on-line adversary that knows everything about the algorithm and is unknown to the algorithm. At any point during the computation, the adversary knows the state of the computation, the contents of the shared and private memories and it can determine what instructions are being executed or about to be executed by the individual processors. 2. Any processor may fail at any time during any update cycle, or having failed it may restart resynchronized with other processors, provided that: (i) at any time at least one processor is executing an update cycle that successfully completes; (ii) single bit writes are atomic, i.e., failures can occur before or after a write of a single bit. 3. Failures do not a ect the shared memory, but the failed processors lose their private memory.
Processors are restarted at their initial state with their pid as their only knowledge.
The failure and restart patterns are syntactically de ned as follows:
De nition 2.1 A failure pattern F is a set of triples <tag, pid, t > where tag is either failure indicating processor failure, or restart indicating a processor restart, pid is the processor identi er, and t is the time indicating when the processor stops or restarts. The size of the failure pattern F is de ned as the cardinality jFj. 2
For simplicity of presentation, we assume that the shared memory writes of O(log maxfN; Pg) bit words are atomic. Algorithms using this assumption can be easily converted to use only single bit atomic writes as in 21].
6
We investigate two natural complexity measures, completed work and overhead ratio. The completed work measure generalizes the standard Parallel-time Processors product and the Available Processor Steps (S) of 21] . The overhead ratio is an amortized measure.
completing its task on some input data I and in the presence of a failure pattern F. If P i (F) P is the number of processors completing an update cycle at time i, and c is the time required to complete one update cycle, then we de ne S(I; F; P) as:
Update cycles are units of accounting. They do not constrain the instruction set of the PRAM, and failures can occur between the instructions of an update cycle. However, in S(I; F; P) the processors are not charged for the read and write instructions of update cycles that are not completed.
De nition 2.3 A P-processor PRAM algorithm on any input data I of size jIj = N, and in the presence of any pattern F of failures and restarts of size jFj M, uses completed work S = S N;M;P = max I;F fS(I; F; P)g , and has overhead ratio = N;M;P = max I;F S(I; F; P) jIj + jFj : 2
Consider a de nition of total work S 0 (I; F; P) that also counts incomplete update cycles. Clearly S 0 (I; F; P) S(I; F; P)+cjFj. Thus, using S 0 does asymptotically a ect the measure of work (when jFj is very large), but it does not asymptotically a ect .
One might also generalize the overhead ratio as S(I;F;P) T(jIj)+jFj , where T(jIj) is the time complexity of the best sequential solution known to date for the particular problem at hand. For the purposes of this exposition, it is su cient to express in terms of the ratio S(I;F;P) jIj+jFj . This is because for Work vs. overhead ratio: For arbitrary processor failures and restarts, the completed work measure S (or the total work S 0 ) depends on the size N of the input I, the number of processors P, and the size of the failure pattern F. The ultimate performance goal for a parallel fault-tolerant algorithm is to perform the required computation at a work cost as close as possible to the work performed by the best sequential algorithm known. Unfortunately, this goal is not attainable when an adversary succeeds in causing too many processor failures during a computation.
Example A: Consider a Write-All solution, where it takes a processor one instruction to recover from a failure. If an adversary in a failure pattern F with the number of failures and restarts jFj = (N 1+"
) for " > 0, then the completed work will be (N 1+"
), and thus already non-optimal and potentially large, regardless of how e cient the algorithm is otherwise. Yet the algorithm may be extremely e cient, since it takes only one instruction to handle a failure. However, when F can be large relative to N and P (as is the case when restarts are allowed) better re ects the e ciency of a fault-tolerant algorithm. Recall that is insensitive to the choice of S or S 0 , and to using update cycles, as a measure of work. However, update cycles are necessary for the following two reasons.
Update cycles and termination: Our failure model requires that at any time, at least one processor is executing an update cycle that completes. (This condition subsumes the condition of 21] that one processor does not fail during the computation). This requirement is formulated in terms of update cycles and assures that some progress is made. Since the processors lose their context after a failure, they have to read something to regain it. Without at least one active update cycle completing, the adversary can force the PRAM to thrash by allowing only these reads to be performed. Similar concerns are discussed in 40].
Update cycles as a unit of accounting: In our de nition of completed work we only count completed update cycles. Even if the progress and termination of a computation is assured (by always completely executing at least one update cycle), but the processors are charged for incomplete update cycles, the work S 0 of any algorithm that simulates a single N processor PRAM step is at least (P N). The reason for this quadratic behavior in S 0 is the following simple and rather uninteresting thrashing adversary.
Example B: We evaluate the work of any solution for the Write-All problem under the arbitrary failure and restart model. Consider the standard PRAM read-compute-write cycle (if processors begin writing without reading, a simple modi cation of the argument leads to the same result). A thrashing adversary allows all processors to perform the read and compute instructions; then it fails all but one processor for the write operation. Failed processors are then restarted. Since one write operation is performed per cycle, N cycles will be required to initialize N array elements. Each of the P processors performs (N) instructions which results in work of (P N). 2 By charging the processors only for the completed xed size update cycles we do not charge for thrashing adversaries. This change in cost measure allows sub-quadratic solutions.
The Asynchronous PRAM
The asynchronous PRAM model departs from the standard PRAM models in that the processors are completely asynchronous. The only synchronizing assumption is that reads and writes to memory are atomic and serializable, in the sense of Lamport 28] . Serializability means that the result of a computation is consistent with some total ordering of atomic actions. (Note that this does not mean that the actions are in fact ordered this way, but that the e ect of the computation is as if they were.) This is a restriction on the possible outcome of simultaneous events. With asynchronous processors, the distinction between exclusive writes and concurrent writes disappears. Among the traditional synchronous PRAM models, the arbitrary CRCW PRAM is closest to the asynchronous model.
One important situation that is modelled by the asynchronous PRAM is the case in which the processors are \nearly synchronous." If identical processors access shared memory across a common communication channel or network, then they will run at approximately the same speed, but the precise interleaving of memory operations may not be under the direct control of the processors. To model the lack of control over the interleaving, we posit an on-line adversary that chooses the 8 interleaving to maximize the cost of the computation. At any point in time the adversary knows the state of the comuptation, the contents of all memory locations and it is free to delay any processor for any length of time.
De nition 2.4 We de ne an interleaving to be a sequence of processor numbers, each in the range 0; P ? 1] . An execution of a PRAM algorithm consistent with a particular interleaving is the execution of steps by the processors in the order speci ed by the interleaving. 2
The measure of the e ciency of an asynchronous PRAM is the total number of steps completed, which we term the total work of the computation (expressed in terms of P and the input size N).
To de ne total work, we assume that each processor executes a halt instruction when it terminates work on the algorithm. In order for the algorithm to be correct, it must be the case that at this point, the postconditions for the algorithm are satis ed. It is the responsibility of the algorithm to ensure that once a single processor halted, no other processor takes action that de-establishes the postcondition.
De nition 2.5 The total work of an algorithm with respect to a given interleaving is the length of the smallest halt-free pre x of that interleaving. The total work required by an algorithm is then the maximum total work over all possible interleavings of the processors. (Note that in this worst case, all processors will be ready to execute halt instructions.) 2
Previous work along these lines has assumed either that randomized algorithms can be used to defeat o -line adversaries ( 34] ) or that interleavings are chosen according to some probabilistic distribution ( 9, 35] ). Some of the models in these last two papers are similar to our restartable fail-stop model, but failures are probabilistic and restarts do not destroy private memory. Because of our worst-case assumptions, these analyses are inappropriate. Furthermore, notions of time used in 9] do not work here, because our scheduling adversary may introduce arbitrarily long delays.
The notion of wait-free asynchronous computation, in which any one processor terminates in a nite number of steps regardless of the speeds of the other processors, is introduced in 16]. In the asynchronous PRAM, by de nition any algorithm with bounded work must be wait-free. The same paper shows that atomic reads and writes are insu cient to solve two-processor consensus, and demonstrates a hierarchy of stronger primitives for accessing memory (such as test-and-set or compare-and-swap). A later paper ( 5] ) demonstrates wait-free data structures using only atomic reads and writes.
Finally, we note that the asynchronous model is a very general one, and it is subject to fewer de nitional restrictions than is its fail-stop restartable counterpart. However, as a result of such restrictions, the fail-stop model can be used for e cient general deterministic simulations of synchronous PRAM (as we show in Section 5). It does not appear to be the case that e cient deterministic simulations are possible in the asynchronous model. When randomization is used, it is possible to construct e cient simulations for o -line adversaries as recently shown by Kedem et al. 24] . When asynchronous processors also have initial private data, the computational capability of the model is further moderated by the asynchronous consensus impossibility results 10, 16, 30].
Comparison of the models
On the surface, the two models of restartable fail-stop processors and of asynchronous processors are designed for quite di erent situations. The fail-stop model treats failure as an abnormal event, which occurs with su cient frequency that it cannot be ignored. The asynchronous model treats delay as a normal occurrence. Nevertheless, the two models are closely related.
Consider an execution of an asynchronous algorithm. Because the events are serializable, we may assume without loss of generality that the events occur at discrete times. In other words, a set of time slices is xed in advance, and the scheduling adversary chooses at each time slice whether or not each processor will start running during that time slice. From this viewpoint, the two models di er in the following ways.
1. Processors that miss a time slice lose their internal state in the restartable fail-stop case, and keep their internal state in the asynchronous case. 2. The adversary can stop a processor after any memory operation within a time slice in the restartable fail-stop case while this has no e ect on the asynchronous case. 3. The time slices are long enough for several memory operations in the restartable fail-stop case but allow only a single operation in the asynchronous case.
From the algorithmic point of view, the di erence between the models concerns the number of failures during an execution of the algorithm. In the restartable fail-stop model, failure is treated as a signi cant event, and the number of failures may be taken into account when measuring the e ciency of the algorithm. In the asynchronous model, delay is the rule rather than the exception, and the number of delays is not a particularly meaningful quantity. A normal execution may involve many delays of each processor between each consecutive step.
An algorithm that performs a bounded amount of work for any number of failures, and has a small amount of state information, is suitable for either model. An algorithm whose performance degrades signi cantly as the number of failures increases, however, may only be suitable for the restartable fail-stop model. Algorithms W and V (as presented in Section 4) are examples of the latter case; algorithms X and T exemplify the former case.
Lower bounds for the Write-All problem
Here we show that up to a logarithmic overhead in work will be required by any Write-All algorithm in the models we consider. A stronger result was given by Kedem et al. 26 ] who showed similar lower bounds but for a more constrained (fail-stop no-restart) model. The bound in 26] can also be extended to test-and-set operations. The results in this section are of interest for various reasons.
The analysis of algorithm V in Section 4 uses the bounds shown in Theorems 3.1 and 3.3. We use less constrained models and the lower bounds stand even if processors are allowed to read the entire shared memory in unit time. Finally, our proofs are much simpler and they use only the rst principles and require no additional machinery.
Lower bounds with memory snapshots
As we have shown in Example B in Section 2.1, without the update cycle accounting there is a thrashing adversary that exhibits a quadratic lower bound for the Write-All problem in the restartable fail-stop model. With the update cycle accounting and for the asynchronous model, we show N + (P log P) work lower bounds (when P N) for both models, even when the processors can take unit time memory snapshots, i.e., processors can read and locally process the entire shared memory at unit cost. Theorem 3.1 Given any P-processor CRCW PRAM algorithm that solves the Write-All problem of size N (P N), an adversary (that can cause arbitrary processor failures and restarts) can force the algorithm to perform N + (P log P) completed work steps.
Proof: Let Z be any algorithm for the Write-All problem subject to arbitrary failure/restarts using update cycles. Consider each PRAM cycle. The adversary uses the following strategy:
Let U > 1 be the number of unvisited array elements, i.e., the elements that no processor succeeded in writing to. For as long as U > P, the adversary induces no failures. The work needed to visit N ? P array elements when there were no failures is at least N ? P.
As soon as a processor is about to visit the element N ?P +1 making U P, the adversary fails and then restarts all N processors. For the upcoming cycle, the adversary examines the algorithm to determines how the processors are assigned to write to array elements. The adversary then lists the rst b U 2 c unvisited elements with the least processors assigned to them. The total number of processors assigned to these elements does not exceed d P 2 e. The adversary fails these processors, allowing all others to proceed. Therefore at least b P 2 c processors will complete this step having visited no more than half of the remaining unvisited array locations.
This strategy can be continued for at least log P iterations. The work performed by the algorithm will be S N ? P + b P 2 c log P = N + (P log P). 2
Note that the bound holds even if processors are only charged for writes into the array of size N and do not have to only write the value 1. The simplicity of this strategy ensures that the results hold in the asynchronous model. Theorem 3.2 Any N-processor asynchronous PRAM algorithm that solves the Write-All problem of size N has total work N ? P + (P log P).
Proof: Any possible execution of an algorithm on the restartable fail-stop model can be duplicated by an appropriate interleaving on the asynchronous model. The argument in Theorem 3.1 works even if failed processors do not lose local state, and so the same strategy will work in the asynchronous model. 2
This lower bound is the tightest possible bound under the assumption that the processors can read and locally process the entire shared memory at unit cost. Although such an assumption is very strong, we present the matching upper bound for two reasons. First, it demonstrates that any improvement to the lower bound must take account of the fact that processors can read only a constant number of cells per update cycle. Second, it presents a simple processor allocation strategy that we use to advantage in algorithm V in Section 4. Theorem 3.3 If processors can read and locally process the entire shared memory at unit cost, then a solution for the Write-All problem in the restartable fail-stop model can be constructed such that its completed work using P processors on an input of size N is S = N ? P + O(P log P), when P N.
LOWER BOUNDS FOR THE WRITE-ALL PROBLEM 11
Proof: The processors follow the following simple strategy: at each step that a processor PID is active, it reads the N elements of the array x 1::N] to be visited. Say U of these elements are still not visited. The processor numbers these U elements from 1 to U based on their position in the array, and assigns itself to the ith unvisited element such that i = dPID U P e. This achieves load balancing with no more than d P U e processors assigned to each unvisited element. The reading and local processing is done as a snapshot at unit cost.
We list the elements of the Write-All array in ascending order according to the time at which the elements are visited (ties are broken arbitrarily). We divide this list into adjacent segments numbered sequentially starting with 0, such that the segment 0 contains V 0 = N ? P elements, and segment j 1 contains V j = b P j(j+1) c elements, for j = 1; :::; m and for some m p P. Let U j be the least possible number of unvisited elements when processors were being assigned to the elements of the jth segment. U j can be computed as U j = N ? P j?1 i=0 V i . U 0 is of course N, and for j 1, U j = P ? P j?1 i=1 V i P ? (P ? P j ) = P j . Therefore no more than d P Uj e processors were assigned to each element. The work performed by such an algorithm is:
Remark: Under the memory snapshot assumption, it can be shown that the (N log N= log log N)
lower bound of Kanellakis and Shvartsman 21] is the best possible bound for failures without restarts. This is done by adapting the analysis of algorithm W by Martel 32] . According to the analysis, the number of \block-steps" of W for P = N is O(N log N= log log N) and each block-step can be realized at unit cost using memory snaphsots. A similar situation holds in the asynchronous model. Theorem 3.4 If processors can read and locally process the entire shared memory at unit cost, then a solution for the Write-All problem in the asynchronous model can be constructed with total work N ? P + O(P log P) using P processors on input of size N, for P N.
Proof: We use the same algorithm as in the previous proof. The proof itself applies to the asynchronous model with the following modi cations: (1) one unit of total work is charged for each read and the write that (potentially) follows; (2) as soon as a processor performs a read, it is charged one unit work; this is done to take care of the situation when a processor performs a write only after all elements in a given segment have been initialized. 2 
Lower bounds with test-and-set operations
Under certain assumptions on the way that memory is accessed in the asynchronous model, we can prove a di erent lower bound. Assume for the moment that, instead of atomic reads and writes, memory is accessed by means of test-and-set operations. That is, memory can only contain zeroes and ones, and a single test-and-set operation on a memory cell sets the value of that cell to 1 and returns the old value of the cell. (We will discuss shortly how this assumption can be generalized.) 12 Theorem 3.5 Any asynchronous PRAM algorithm for the Write-All problem which uses testand-set as an atomic operation requires N + (P log(N=P)) total work, for P 3.
Proof: Consider the following class of interleavings. A round will be a length of time in which processors take one step each in PID order; formally, it is the sequence of PIDs h1; 2; : : :Pi. We will run the algorithm in phases. To de ne a phase, suppose that U cells out of the original N remain unset at the beginning of a phase. We imagine running the algorithm in rounds until a collision occurs; that is, until a test-and-set operation is done on a cell that is already set to one.
Suppose this happens in the tth round. The actual de nition of the phase depends on the nature of the collision; there are two cases.
If the cell involved in the collision was set in this round, then it was initially set by some processor with PID i, and set again by some processor with PID j. Then to de ne the phase, we let only processors i and j alternate steps, instead of running all processors; that is, the phase consists of the PIDs i; j repeated t times. A total of 2t steps are taken and one of them is wasted work.
On the other hand, if the cell was set in a previous round, then consider the processor with PID j that set it in this round and let only this processor take steps. That is, the phase consists of the PID j repeated t times, for a total of t steps and one wasted step. We now note that t must be at most dU=Pe, and so a recurrence for the amount of wasted work W(U) is W(U) 1 + W(U ? 2dU=Pe + 1). By induction, we can show that W(U) cP ln(U=2P) for a suitable constant c > 0; the result follows by noting that unwasted work N is necessary.
The trivial base case of the induction is U 2P. Now suppose that the inequality W(x) cP ln(x=2P) holds for all integer x < U. By the induction hypothesis, we have W(U) cP ln((U ? 2dU=Pe + 1)=2P) 1 + cP ln(U=2P) + cP ln(1 ?2=P ?1=U). It thus su ces to prove 1 + cP ln(1 ? 2=P ? 1=U) 0. But 1 + cP ln(1 ? 2=P ? 1=U) 1 + cP ln(1 ? 5=(2P)) 1 + cP(?5=(2P ? 5)) 0: The rst inequality is valid because U > 2P; the second inequality uses ln(1 ? z) ?z=(1 ? z), which can be seen by comparing power series; the third inequality is valid for P 3 and any choice of c 1=15. No attempt was made to optimize the constant c. 2
The argument used in this lower bound can be applied equally well if the atomic operation is compare-and-swap, or to any set of atomic read-modify-write operations where the read and writes are constrained to be to the same cells. It also applies to atomic read and atomic write, but in this case there is no known matching upper bound, whereas algorithm T (presented in the next section) can match the lower bound (for some choices of atomic operation) in the case P = 3. The above proof technique also applies to the fail-stop restartable model, when each update cycle accesses only one array element used by the Write-All problem.
4 Algorithms for the Write-All problem
The original motivation for studying the Write-All problem was that it intuitively captured the essential nature of a single synchronous PRAM step. This intuition was made concrete when it was shown ( 25, 42] ) how to use any algorithm for the Write-All problem in general PRAM simulations.
ALGORITHMS FOR THE WRITE-ALL PROBLEM 13
This application is discussed in the next section; in this section, we will present new algorithms for the Write-All problem.
In what follows, we assume that the number of array elements N and the number of processors P are powers of 2. Nonpowers of 2 can be handled using conventional padding techniques. All logarithms are base 2.
4.1 Algorithm V : a modi cation of a no-restart algorithm Algorithm W of 21] is an e cient fail-stop (no restart) Write-All solution. The algorithm uses two full binary trees as its basic data structures (the processor counting and the progress measurement trees). The algorithm uses an iterative approach in which all active processors synchronously execute the following four phases: W1: Processors are counted and enumerated using a static bottom-up, logarithmic time traversal of the processor counting tree data structure. W2: Processors are allocated to the unvisited array locations according to a divide-and-conquer strategy using a dynamic top-down traversal of the progress tree data structure. W3: Array assignments are done. W4: Progress is evaluated by a dynamic bottom-up traversal of the progress tree data structure.
This algorithm has e cient completed work when subjected to arbitrary failure patterns without restarts. It can be extended to handle processor restarts by introducing an iteration counter, and having the revived processors wait for the start of a new iteration. However, this algorithm may not terminate if the adversary does not allow any of the processors that were alive at the beginning of an iteration to complete that iteration. Even if the extended algorithm were to terminate, its completed work is not bounded by a function of N and P.
In addition, the proof framework of 21] does not easily extend to include processor restarts: the processor enumeration and allocation phases become ine cient and possibly incorrect, since no accurate estimates of active processors can be obtained when the adversary can revive any of the failed processors at any time.
On the other hand, the second phase of algorithm W can implement processor assignment (in a manner similar to that used in the proof of Theorem 3.3) in O(log N) time by using the permanent processor PID in the top-down divide-and-conquer allocation. This also suggests that the processor enumeration phase of algorithm W does not improve its e ciency when processors can be restarted.
Therefore we present a modi ed version of algorithm W, that we call V . To avoid a complete restatement of the details of algorithm V , the reader is urged to refer to 21].
V uses the data structures of the optimized algorithm W of 21] (i.e., full binary trees with N log N leaves) for progress estimation and processor allocation. There are log N array elements associated with each leaf. When using P processors such that P > N log N on such data structures, it is su cient for each processor to take its PID modulo N log N to assure that there is a uniform initial assignment of at least bP= N log N c and no more than dP= N log N e processors to a work element. Algorithm V is an iterative algorithm using the following three phases that are based on the phases W2, W3 and W4 of algorithm W. Processor re-synchronization after a failure and a restart is an important implementation detail. The model assumes re-synchronization on the instruction level, but the processors still need to be synchronized with respect to the phases. One way of realizing processor re-synchronization is through the utilization of an iteration wrap-around counter that is based on the synchronous PRAM clock. If a processor fails, and then is restarted, it waits for the counter wrap-around to rejoin the computation. The point at which the counter wraps around depends on the length of the program code, but it is xed at \compile time".
Analysis of algorithm V :
We now analyze the performance of this algorithm rst in the fail-stop, and then in the fail-stop and restart setting. Proof: We factor out any work that is wasted due to failures by charging this work to the failures.
Since the failures are fail-stop, there can be at most P failures, and each processor that fails can waste at most O(log N) steps corresponding to a single iteration of the algorithm. Therefore the work charged to the failures is O(P log N), and it will be absorbed by the rest of the work.
We next evaluate the work that directly contributes to the progress of the algorithm by distinguishing two cases below. In each of the cases, it takes O(log N log N ) = O(log N) time to perform processor allocation, and O(log N) time to perform the work at the leaves. Thus each iteration of the algorithm takes O(log N) time. We use the allocation technique of Theorem 3.3, where instead of reading and locally processing the entire memory at unit cost, we use an O(log N) time iteration for processor allocation. The results of the two cases combine to yield S = O(N + P log 2 N). 2
The above upper bound analysis is tight:
Theorem 4.2 There is a fail-stop adversary that causes the work of algorithm V to be S = (P log 2 N) for the number of processors N= log N P N, and S = (N + P log N log P) for the number of processors 1 P N= log N.
Proof: Consider the following adversary for P = N= log N. At the outset the adversary fail-stops all processors that are initially assigned to the, say, left subtree of the progress tree. This is the only action by the adversary. For the iteration i of algorithm V , let the number of unvisited leaves in the progress tree be U i . The P = N= log N processors (whether dead or alive) will be assigned in a balanced fashion to the left and right segments of the contiguous U i unvisited elements. Initially, U 0 is N= log N, and in each iteration of the algorithm a half of the leaves will be visited by the live processors. Therefore the algorithm will terminate in log U 0 = (log N) block-steps after that initial stoppage of the processors by the adversary. Each block-step takes (log N) time using the remaining P=2 processors. Thus the work is S = P 2 (log N) (log N) = (P log 2 N) = (N log N). When P is larger than N= log N, then each leaf is allocated at least bP= N log N c and no more than dP= N log N e processors. All processors allocated to the same leaf have their PIDs equal modulo N= log N. Therefore the work is increased by at least a factor of bP= N log N c as compared to the case P = N= log N. I.e., S = bP= N log N c (N log N) = (P log 2 N). Finally, when P < N= log N, the result follows similarly using the strategy of the case (1) of Lemma 4.1. 2
The following theorem expresses the completed work of the algorithm in the presence of restarts: Theorem 4.3 The completed work of algorithm V using P N processors subject to an arbitrary failure and restart pattern F of size M is: S = O(N + P log 2 N + M log N).
Proof: The proof of Lemma 4.1 does not rely on the fact that in the absence of restarts, the number of active processors is non-increasing. However, the lemma does not account for the work that might be performed by processors that are active during a part of an iteration but do not contribute to the progress of the algorithm due to failures. To account for all work, we are going to charge to the array being processed the work that contributes to progress, and any work that was wasted due to failures will be charged to the failures and restarts. Lemma 4.1 accounts for the work charged to the array. Otherwise, we observe that a processor can waste no more than O(log N) time steps without contributing to the progress due to a failure and/or a restart. Therefore this amount of wasted work is bounded by O(M log N). This proves the theorem. (Note that the completed work S of V is small for small jFj, but not bounded by a function of P and N for large jFj). 2 Corollary 4.4 The completed work of algorithm V using P N= log 2 N processors subject to an arbitrary failure and restart pattern F of size M N= log N is: S = O(N). 
Algorithm X: a binary tree algorithm
We present a new algorithm X for the Write-All problem, and show that its completed/total work complexity is S = O(N P log 3 2 ) using P N processors in the restartable fail-stop and the asynchronous models of computation. The important property of X is that it has bounded sub-quadratic completed work; in the restartable fail-stop model, this is independent of the failure pattern. If a very large number of failures occurs, say jFj = (N P 0:59 ), then the algorithm's overhead ratio becomes optimal: it takes a xed number of computing steps per failure/recovery.
Like algorithm V , algorithm X utilizes a progress tree of size N, but it is traversed by the processors independently, not in synchronized phases. This re ects the local nature of the processor assignment in algorithm X as opposed to the global assignments used in algorithms V and W. Each processor, acting independently, searches for work in the smallest immediate subtree that has work that needs to be done. It then performs the necessary work, and moves out of that subtree when no more work remains. We present the algorithm on the restartable fail-stop model. Thus, the overall memory used is O(N + P) and the data-structures are simple.
Control-ow: The algorithm consists of a single initialization and of the parallel loop. A high level view of the algorithm is in Figure 2 ; all line numbers refer to this gure. More detailed code can be found in Appendix A.
The initialization (line 01) assigns the P processors to the leaves of the progress tree so that the processors are assigned to the rst P leaves by storing the initial leaf assignment in w PID]. The loop (lines 02-13) consists of a multi-way decision (lines 03-12). If the current node u is marked done, the processor moves up the tree (line 04). If the processor is at a leaf, it performs work (line 05). If the current node is an unmarked interior node and both of its subtrees are done, the interior node is marked by changing its value from 0 to 1 (line 08). If a single subtree is not done, the processor moves down appropriately (line 09).
For the nal case (line 10), the processors move down when neither child is done. This last case is where a non-trivial decision is made. The PID of the processor is used at depth h of the tree node based on the value of the h th most signi cant bit of the binary representation of the PID: bit 0 will send the processor to the left, and bit 1 to the right. Regardless of the decision made by a processor within the loop body, each iteration of the body consists of no more than four shared memory reads, a xed time computation using private memory, and one shared memory write (see Appendix A for the detailed algorithm). Therefore the body can be implemented as an update cycle. Figure 3 under the leaves of the tree. The diagram illustrates the state of a computation where the processors were subject to some failures and restarts. Heavy dots indicate nodes whose subtrees are nished. The paths being traversed by the processors are indicated by the arrows. Active processor locations (at the time when the snapshot was taken) are indicated by their PIDs in brackets. In this con guration, should the active processors complete the next cycle, they will move in the directions indicated by the arrows: processors 0 and 1 will descend to the left and right respectively, processor 4 will move to the unvisited leaf to its right, and processors 6 and 7 will move up. 2 
Analysis of algorithm X:
We begin by showing the correctness and termination of algorithm X in the following simple lemma. Proof: We rst observe that the processor loads are localized in the sense that a processor exhausts all work in the vicinity of its original position in the tree, before moving to other areas of the tree. If a processor moves up out of a subtree then all the leaves in that subtree were visited. We also observe that it takes exactly one update cycle to: (i) change the value of a progress tree node from 0 to 1, (ii) to move up from a (non root) node, or (iii) to move down left, or (iv) down right from a (non leaf) node. Therefore, given any node of the progress tree and any processor, the processor will visit and spend exactly one complete update cycle at the node no more than four times.
Since there are 2N ?1 nodes in the progress tree, any processor will be able to execute no more than O(N) completed update cycles. If there are P processors, then all processors will be able to complete no more than O(P N) update cycles. Furthermore, at any point in time, there is at least one update cycle that will complete. Therefore it will take no more than O(P N) sequential update cycles of constant size for the algorithm to terminate.
Finally, we also observe that all paths from a leaf to the root are at least log N long, therefore at least log N update cycles per processor will be required for the algorithm to terminate. 2
Now we prove the main work lemma. In the rest of this section, the expression \S N;P " denotes the completed work on inputs of size N using P initial processors and for any failure pattern. Note that in this lemma we assume P N. For the base case: we have a tree of height 0 that corresponds to an input array of size 1 and at least as many initial processors P. Since at least one processor, and at most P processors will be active, this single leaf will be visited in a constant number of steps. Let the work expended be c 0 P for some constant c 0 that depends only on the lexical structure of the algorithm. Therefore S 1;P = c 0 P c 1 P 1 log 3 2 ? c 2 P 0 ? c 3 P when c 1 is chosen to be larger than or equal to c 3 + c 0 . Now consider a tree of height log N ( 1) . The root has two subtrees (left and right) of height log N ? 1. By the de nition of algorithm X, no processor will leave a subtree until the subtree is marked-one, i.e., the value of the root of the subtree is changed from 0 to 1. We consider the following sub-cases: (1) both subtrees are marked-one simultaneously, and (2) one of the subtrees is marked-one before the other. Case 1: If both subtrees are marked-one simultaneously, then the algorithm will terminate after the two independent subtrees terminate plus some small constant number of steps c 0 (when a processor moves to the root and determines that both of the subtrees are nished). Both the work S L expended in the left subtree of, and the work S R in the right subtree are bounded by S N=2;P=2 . The added work needed for the algorithm to terminate is at most c 0 P, and so the total work is: S S L + S R + c 0 P 2S N=2;P=2 + c 0 P 2 c 1 P 2 ? c 2 P log N ? c 3 P for su ciently large c 1 and any c 2 depending on c 0 , e.g., c 1 3(c 2 + c 0 ). Case 2: Assume without loss of generality that the left subtree is marked-one rst with S L = S N=2;P=2 work being expended in this subtree. Any active processors from the left subtree will start moving via the root to the right subtree. The length of the path traversed by any processor as it moves to the right subtree after the left subtree is nished is bounded by the maximum path length from a leaf to another leaf c 0 log N for a prede ned constant c 0 . No more than the original P=2 processors of the left subtree will move, and so the work of moving the processors is bounded by c 0 (P=2) log N.
We observe that the cost of an execution in which P processors begin at the leaves of a tree (with N=2 leaves) di ers from the cost of an execution where P=2 processors start at the leaves, and P=2 arrive at a later time via the root, by no more than the cost c 0 (P=2) logN accounted for above. (This is because a simulating scenario can be constructed in which the second set of P=2 processors, instead of arriving through the root, start their execution with a failure, and then traverse along a path of 1's (if any) in the progress tree, until they reach a 0 node that is either a leaf, or whose descendants are marked.) Having accounted for the di erence, we see that the work S R to complete the right subtree using up to P processors is bounded by S N=2;P (by the de nition of S, if P 1 P 2 , then S N;P1 S N;P2 ). After this, each processor will spend some constant number of steps moving to the root and terminating the algorithm. This work is bounded by c 00 P for some small constant c 00 . The total work S is: S S L + c 0 P 2 log N + S R + c 00 P S N=2;P=2 + c 0 P 2 log N + S N=2;P + c 00 P 2 ? c 2 P log N ? c 3 P for su ciently large c 2 and c 3 depending on xed c 0 and c 00 , e.g., c 2 c 0 and c 3 3c 2 + 2c 00 .
Since the constants c 0 ; c 00 depend only on the lexical structure of the algorithm, the constants c 1 ; c 2 ; c 3 can always be chosen su ciently large to satisfy the base case and both the cases (1) and (2) of the inductive step. This completes the proof of the lemma. 2
The quantity P N ) work on the input of size N using P = N processors.
Proof: We can compute the exact work performed by the algorithm when the adversary adheres to the following strategy:
(a) All processors, except for the processor with PID 0 are initially stopped.
(b) The processor with PID 0 will be allowed to sequentially traverse the progress tree starting at the leftmost leaf and nishing at the rightmost leaf. The traversal will be essentially a post-order traversal, except that the processor will not begin at the root of the binary tree, but at the leftmost leaf.
(c) Any processors with PID 6 = 0 that nd themselves at the same leaf as processor 0 are restarted in synchrony with processor 0 and are allowed to traverse the progress tree at the same pace as processor 0 until they reach a leaf, where they are fail-stopped by the adversary.
The computation terminates when all leaves are visited.
Thus the leaves of the progress tree are visited left to right, from the leaf number 1 to the leaf number N. At any time, if i is the number of the rightmost visited leaf, then only the processors with PIDs 0 to i ? 1 have performed at least one update cycle thus far.
The cost of such strategy can be expressed inductively as follows:
The cost C 0 of traversing a tree of size 1 using a single processor is 1 (unit of completed work).
The cost C i+1 of traversing a tree of size 2 i+1 is computed as follows: rst, there is the cost C i of traversing the left subtree of size 2 i . Then, all processors move to the right subtree and participate (subject to failures) in the traversal of the right subtree at the cost of 2C i | the cost is doubled, because the two processors whose PIDs are equal modulo i behave identically. Thus C i+1 = 3C i , and C log N = 3 log N = N log 3
: 2
Now we show how to use algorithm X with P processors to solve Write-All problems of size N such that P N. Given an array of size N, we break the N elements of the input into N P groups of P elements each (the last group may have fewer than P elements). The P processors are then used to solve N P Write-All problems of size P one at a time. We call this algorithm X 0 , and we will use X 0 in the general simulations.
Remark: Strictly speaking, it is not necessary to modify algorithm X for P N processors. Algorithm X can be used with P N processors by initially assigning the P processors to the rst P elements of the array to be visited. It can also be shown that X and X 0 have the same asymptotic complexity; however, the analysis of X 0 is very simple, as we show below. ). Thus the overall work will be S = O( N P S P;P ) = O( N P P log 3 ) = O(N P log 3
2 ).
Using the strategy of Lemma 4.7, an adversary causes the algorithm to perform work S P;P = (P X can be no more than the upper bound on the completed work. This is because at any point in time there is at least one update cycle that will complete. Therefore, for algorithm X 0 with P N, the time is bounded by O(N P log 3
2 ). In particular, for P = N, the time is bounded by O (N   log 3 ). In fact, using the worst case strategy of Lemma 4.7, an adversary can \time share" the completed cycles of the processors so only one processor is active at any given time, with the processor with PID 0 being one step ahead of other processors. The resulting time is then (N log 3
).
In algorithm X, processors work for the most part independently of other processors; they attempt to avoid duplicating already-completed work but do not co-ordinate their actions with other processors. It is this property which allows the algorithm to run on the asynchronous model with the same work and time bounds. Proof: If we let S N;P be the total work done by algorithm X on a problem of size N with P processors, then S N;P satis es the same recurrence as given in the proof of Lemma 4.6. The proof, which never uses synchroneity, goes through exactly as in that lemma, except that case 1 (where left and right subtrees have their roots marked simultaneously) does not occur. 2
The nal result of this section is similar to Theorem 4.6: 
Algorithm T : a three-processor algorithm
Quite di erent techniques are necessary when designing a parallel algorithm in which the number of processors is much smaller than the size of the input. The goal in this situation, when the underlying machine is synchronous, is to nd a method whose parallel time complexity is at most the sequential time complexity divided by the number of processors plus a small additive overhead; see 3] for an example of such an algorithm. Note that constant factors are important and cannot be hidden in O-notation. When considering algorithms on fail-stop or asynchronous models, the goal is to have the parallel work complexity be equal to the sequential complexity plus small overhead.
For the Write-All problem, it is easy to achieve this goal with two processors. The processor with PID 0 (henceforth, P 0 ) reads and then writes locations sequentially starting at 1 and moving up; processor P 1 reads and then writes locations sequentially starting at N and moving down. Both processors stop when they read a 1. The completed work is exactly N + 1.
The rst non-trivial case is that of three processors. There are two important points to the algorithm we are about present: the implementation is non-trivial (even though the idea is simple), and the case of four processors is still open. This makes our algorithm interesting.
Here is an intuitive description of a three processor algorithm. Processor P 0 works left-to-right, processor P 1 works right-to-left, and P 2 lls starting from the middle and alternately expanding in both directions. If P 0 and P 2 meet, they both know that an entire pre x of the memory cells has been written. Processor P 0 then jumps to the leftmost cell not written by itself or P 2 , and P 2 jumps to the new \middle" of unwritten cells. A meeting of P 1 and P 2 is symmetric. When P 0 and P 1 meet, the computation is complete. Intuitively, processors can maintain an upper bound on the number of empty cells remaining that starts at N and is halved every time a collision occurs. Thus at most log N collisions are experienced by each processor. High-level pseudo-code for the algorithm is given in Figure 4 . Implementation of the high-level algorithm requires some form of communication among the asynchronous processors. At a collision, a processor must determine which processor previously wrote the cell. In the case of a collision with P 2 , a processor must also determine what portion of the array to jump over. This communication may be implemented either by writing additional information to the cells of the array or by using auxiliary variables.
If the array in which processors are writing is also used to hold auxiliary information, implementation is straightforward. When processor P 2 writes to a cell at the left (resp. right) end of its area, it writes the location of the next unwritten cell to the right (resp. left To solve the pure Write-All problem, in which only 1's are written to the array, auxiliary shared variables are required. These variables must be carefully managed to ensure that the processors maintain a consistent view of the progress of the algorithm. Because a processor may be delayed between reading an auxiliary variable and writing to the array, complete consistency is impossible. Approximate consistency is su cient, however, if the processors are appropriately pessimistic. The precise code is presented and analyzed in Appendix B.
In summary, algorithm T provides the following bounds. In most applications, the array also has room for communication variables, and no auxiliary variables are necessary.
General simulations on restartable fail-stop processors
We now present a major extension to the algorithms presented so far in the restartable fail-stop model. This is an e cient deterministic simulation of any N-processor synchronous PRAM on P restartable fail-stop processors (P N).
We rst formally state the main result and then discuss its proof. Remark: Priority CRCW PRAMs cannot be directly simulated using the same framework, for one of the algorithms used (namely algorithm X in Section 4) does not possess the processor allocation monotonicity property that assures that higher numbered processors simulate the steps of the higher numbered original processors 42].
An approach for executing arbitrary PRAM programs on fail-stop CRCW PRAMs (without restart) was presented independently in 25] and 42]. The execution is based on simulating individual PRAM computation steps using the Write-All paradigm. It was shown that the complexity of solving a N-size instance of the Write-All problem using P fail-stop processors is equal to the complexity of executing a single N-processor PRAM step on a fail-stop P-processor PRAM. Here we describe how algorithms V and X 0 are combined with the framework of 25] or 42] to yield e cient executions of PRAM programs on PRAMs that are subject to stop-failures and restarts as stated in Theorem 5.1. Proof: The executions of algorithms V and X 0 can be interleaved to yield an algorithm that achieves the performance as stated. The completed work complexity is asymptotically equal to the minimum of the completed work performed by V and X 0 . This is because the number of cycles performed by each algorithm in the interleaving di ers by at most a multiplicative constant.
The overhead ratio is directly inherited from algorithm V by the same reasoning because of the De nition 2.3 of and S. 2
The simulations of the individual PRAM steps are based on replacing the trivial array assignments in a Write-All solution with the appropriate components of the PRAM steps. These steps are decomposed into a xed number of assignments corresponding to the standard fetch/decode/execute RAM instruction cycles in which the data words are moved between the shared memory and the internal processor registers. The resulting algorithm is then used to interpret the individual cycles using the available fail-stop processors and to ensure that the results of computations are stored in temporary memory before simulating the synchronous updates of the shared memory with the new values. For the details on this technique, the reader is referred to 21, 25, 42] . Application of these techniques in conjunction with the algorithms V and X 0 yield e cient and terminating executions of any non-fault-tolerant PRAM programs in the presence of arbitrary failure and restart patterns. ), then is O(1).
Thus the overhead e ciency of our algorithm actually improves for large failure patterns. These results also suggest that it is harder to deal e ciently with a few worst case failures than with a large number of failures.
Our next corollary demonstrates a non-trivial range of parameters for which the completed work is optimal; i.e., with Corollary 4.4, the work performed in executing a parallel algorithm on a faulty PRAM is asymptotically equal to the Parallel-time Processors product for that algorithm.
Corollary 5.5 Any N-processor, -time PRAM algorithm can be executed on a P N= log 2 N processor fail-stop CRCW PRAM, such that when during the execution of each N-processor step of that algorithm the total number of processor failures and restarts is O(N= log N), then the completed work is S = O( N).
Discussion and Open Problems
We conclude with a brief discussion of open problems and the e ects of on-line adversaries on the expected performance of randomized algorithms.
Lower bounds: We have shown an (N log N) lower bound (when N = P) for the WriteAll problem in both the restartable fail-stop and the asynchronous models under the assumption that processors can read and locally process the entire shared memory at unit cost. Under this assumption, these are the best possible lower bounds. Can these lower bounds be improved for the fail-stop restartable and the asynchronous models? for algorithm X of (N log 2 N= log log N) 29] . As the corollary of this result, the upper bound for algorithm X is no better than the upper bound for algorithm W for the fail-stop no-restart model.
Can algorithm T be generalized to work with more than three processors, or can another (more general) algorithm be found that achieves truly optimal speedup for small numbers of processors?
Model issues: What is the minimum number of reads and writes necessary in an update cycle to ensure e cient algorithms? What is the precise relationship between the complexity of problems (as opposed to algorithms) on the two models presented here? Finally, are there e cient algorithms for important problems that do not come from simulation of synchronous PRAM algorithms?
On randomization and lower bounds: Analyses of randomized solutions for the Write A Algorithm X pseudocode Here we give detailed pseudocode for algorithm X on the restartable fail-stop model.
In the pseudocode, the action, recovery end construct of SS 83] is used to denote the actions and the recovery procedures for the processors. In the algorithm this signi es that an action is also its own recovery action, should a processor fail at any point within the action block.
The notation \PID log(k)]" is used to denote the binary true/false value of the blog(k)c-th bit of the log(N)-bit representation of PID, where the most signi cant bit is the bit number 0, and the least signi cant bit is bit number log N. Finally, div stands for integer division with truncation.
The action/recovery construct can be implemented by appropriately checkpointing the instruction counter in stable storage as the last instruction of an action, and reading the instruction counter upon a restart. This is amenable to automatic implementation by a compiler. It is possible to perform local optimization of the algorithm by: (i) evenly spacing the P processors N=P leaves apart by when P < N, and by (ii) using the integer values at the progress tree nodes to represent the known number of descendent leaves visited by the algorithm. Our worst case analysis does not bene t from these modi cations.
The algorithm can be used to solve Write-All \in place" using the array x ] as a tree of height log(N=2) with the leaves x N=2::N-1], and doubling up the processors at the leaves, and using x N]
as the nal element to be initialized and used as the algorithm termination sentinel. With this modi cation, array d ] is not needed. The asymptotic e ciency of the algorithm is not a ected.
B Algorithm T pseudocode
The code for algorithm T is given in four parts. The shared declaration part in Figure 6 is followed by one part for each of the three processors in Figures 7 and 8 (algorithm T i for processor P i ). The code given is designed for easy proof of correctness, rather than optimality.
T 0 and T 1 terminate because I 0 increases and I 1 decreases with every loop iteration. T 2 terminates because every loop iteration either increases i or decreases Right2 ? Left2. Since any execution of algorithm T is equivalent to some serialized execution, the following lemma implies that all cells of the array x are 1 at termination. Lemma B.1 Every serialized execution of algorithm T maintains the following invariants. The variable temp0 holds a value of Mid2 that was valid at some time after the write and before Left2 was increased by a subsequent execution of procedure jumpright. If P 2 had not yet jumped, conditions 8 and 5 imply the preservation of condition 1. Otherwise, P 2 jumped to the left because of a collision with P 1 , and the entire array has been written, satisfying all of the invariants.
The case of assignments to I 1 is symmetrical. The assignment Left2 := Mid2 + i is executed only after P 0 has written to cell Mid2 ? i, and hence conditions 1, 5 and 6 imply preservation of condition 3. Similarly, Right2 := Mid2 ? i Lemma B.2 Suppose two processors both write to cell k. Then one (or both) of the processors will collide in its next loop iteration.
Proof: One of the two processors must be P 0 or P 1 . If it is P 0 , then the other will next attempt to write to cell k ? 1 and collide. If it is P 1 , then the other will next attempt to write to cell k + 1 and collide. (In either case, the collision may involve the third processor.) 2 Lemma B.3 There are O(log N) collisions. Proof: When P 2 jumps, the quantity Right2 ? Left2 decreases by a factor of at least 2. Hence P 2 collides at most log N times. Also, P 0 can collide with P 1 , and P 1 with P 0 , at most once each.
Suppose P 0 collides with P 2 in attempting to write to cell k. Because P 0 did not collide with P 1 , P 2 wrote to cell k with some value m in Mid2 and the value m ? k in i. If P 2 continues to process, it will collide with either P 0 or P 1 after at most two iterations, when the value of i has become m ? k + 2. (The worst case occurs if P 0 and P 2 both write cell k ? 1.) Hence the only cells that P 2 writes with m in Mid2 are in the interval k ? 1; 2m ? k + 1]. Thus P 0 attempts to write at most four cells in the interval (i.e., cells k ? 1, k, 2m ? k and 2m ? k + 1), and can collide only at the latter three. Therefore, the number of collisions of P 0 with P 2 is at most three times the number of collisions of P 2 .
Similarly, the number of collisions of P 1 with P 2 is at most three times the number of collisions of P 2 . Hence the total number of collisions in O(log N), as required. Proof: The result follows directly from the above discussion. 2
If the cells of array x can hold arbitrary integer values, then the information communicated by the values of the shared auxiliary variables can be stored directly in the array. Processors P 0 and P 1 write ?1 and ?2 respectively. Processor P 2 writes the value Mid2 + i when writing to the left of Mid2 and the value Mid2 ? i when writing to the right of Mid2. In this case, only private local variables are required.
