Abstract
Introduction
Most of theoretical research on parallel computation has focused on shared-memory models such as PRAM [15] or specific interconnection network models such as the hypercube [23] . In these models the ratio of memory to processors is fairly small (typically Ç´½µ). Recently, bridging models such as BSP [30, 31] , CGM [7] , and LogP [6, 18] have been proposed, where the ratio of memory to processors is non-constant. In particular, the BSP model attracted considerable attention and many algorithms for important problems have been designed on this model [12, 13, 14, 8] .
In BSP and CGM there are Ô processors and a memory of Ò words (usually, Ò is also the input size) is distributed evenly across the processors.
The BSP and CGM models, as is usually the case, refers to ideal machines where all processors are fully operational. But the increasing complexity of multiprocessor computers makes the machines prone to hardware failures. This necessitates the design of algorithms that are resilient to hardware faults. A most natural solution is to design a general simulation mechanism of an ideal machine on its faulty counterpart.
In this paper we develop general techniques that can efficiently simulate ideal BSP (CGM) computations on BSP (CGM) machines where some processors may be faulty. The faults are deterministic (i.e., worst-case distributions of faults are considered) and static (i.e., they do not change in the course of computation). We assume that a constant fraction of processors are faulty and that a processor that tries to communicate with a faulty processor is notified, at the end of the communication round, that the communication was unsuccessful.
The BSP and CGM models of computation
We first describe the BSP model. CGM is different from BSP only in the way the cost of a computation is calculated. A BSP machine consists of Ô processor/memory components communicating through some interconnection network. Ä is the time for a global barrier synchronization and describes the ratio of computation and communication throughput. The BSP machine proceeds in a sequence of communication/computation rounds (which Valiant calls supersteps). In a single superstep each processor may send or receive messages (typically, ¢´Ò Ôµ where Ò is the total memory size) and then perform an internal computation on its internal memory. The parameter is called the bandwidth a processor uses. The internal computation and the messages sent out in a round Ø can depend only on data locally available before the start of round Ø (i.e., on messages received in round Ø ½, but not on messages received in round Ø).
Each superstep is charged a cost of Ñ Ü Ä Ü
where Ü is the maximal number of local computations on every processor and is the maximal number of packets that any processor sends or receives in the superstep. The total cost of an algorithm is the sum of the costs of each superstep performed by the algorithm.
The CGM model explicitly specifies the parameter , i.e., it is assumed that ¢´Ò Ôµ. The objective of designing an efficient CGM algorithm is to minimize the local computation time and the number of communication rounds.
It is convenient to assume a global address space containing the memory of all processors. A memory location with address Ð in È´ µ is represented globally by the paiŕ Ðµ. For a datum that is stored in global address´ Ðµ, Ö´ µ (the pointer to ) denotes the global address´ Ðµ.
Related works
The fault-tolerant simulation for PRAM with faulty processors was first studied by Kanellakis and Shvartsman [17] (for the deterministic distribution of faults) and Kedem, Palem and Spirakis [20] (for probabilistic faults). They considered fail-stop dynamic faults, i.e., faults may happen at any time during the computation and cause processors to stop till the end of computation. For the same model of faults Kedem et al. [19] designed fault-tolerant simulations that have constant expected slowdown under some assumptions on the faults probability. Diks and Pelc [9] developed efficient simulation algorithms for EREW PRAM under the probabilistic processor-fault model. Kanellakis, Michailidis and Shvartsman [16] developed an optimal simulation for EREW PRAM when all processor faults are static. Buss et al. [2] developed deterministic PRAM simulations under the model of restartable failures.
For PRAM with faulty memory, Chlebus, Gambin and Indyk [3] designed randomized simulations for both dynamic and static faults, assuming a constant fraction of deterministic memory faults. For static faults, they gave a simulation algorithm of Ò-processor CRCW PRAM on Ò ÐÓ Ò-processor CRCW PRAM with Ç´ÐÓ Òµ step overhead, which is preceded by an Ç´ÐÓ ¿ ¾ Òµ-time preprocessing. They also gave an algorithm for performing the simulation in real time, but in this case the number of processors of the faulty machine grows to Ç´Ò ÐÓ Òµ. Chlebus, Gasieniec and Pelc [5] presented a deterministic simulation of Ò-processor EREW PRAM on Ò-processor EREW PRAM prone to processor and memory failures, with step overhead Ç´ÐÓ Òµ and preprocessing time Ç´ÐÓ ¾ Òµ. Using randomization, Gasieniec and Indyk [10] reduced the number of processors to Ò ÐÓ Ò and the preprocessing time to Ç´ÐÓ Òµ without changing the step overhead, hence making the simulation work-optimal. They also proved the existence of a deterministic algorithm with the same step overhead and preprocessing time. Recently, Chlebus, Gambin and Indyk [4] To our best knowledge, there are no previous works on the fault-tolerance of CGM.
Our Results
We present general fault-tolerant simulations of computations on BSP and CGM models under static processor faults. We say that a time (or space) bound, which depends on a number Ò, holds with high probability if the probability that the bound will hold is at least ½ Ò « , where « can be any positive constant depending on the constant factor of the bound. The parameter above is the bandwidth of the simulator, which is set to Ò Ô on CGM where only the number of communication rounds is important. On BSP, the average bandwidth Ú of the simulated algorithm is assumed to be given to the simulator. We will later explicitly consider how can be selected on BSP to give a minimal slowdown.
Note that the slowdowns that our results achieve are significantly better than the previous ones for BSP [21, 22] .
Further, one of our results is a deterministic one, while the previous works are all randomized ones.
In BSP and CGM algorithms, scalability is an important issue since large scalability enables algorithms to be applicable without modifications to a wide range of parallel machines while retaining the efficiency. We explicitly consider the memory requirements of the simulation techniques and our simulation techniques work with asymptotically the same amount of memory as that of the simulated algorithm, even when there are only constant amount of memory available per processor. Hence, our results are fully-scalable over all values of Ô from ¢´½µ to ¢´Òµ.
Another issue that should be addressed in fault-tolerance is that of input reconstruction (i.e., recovering the lost parts of the input by decoding). We can modify the technique in [5] so that the input reconstruction can be performed efficiently on BSP and CGM models, but we omit the details in this version.
We now consider the implications of our simulation techniques in practical BSP and CGM computations. In almost all practical BSP and CGM computations, Ô Òf or some constant ¼ ¯ ½. Also, nearly all efficient BSP algorithms have ¢´´Ò Ôµ AE µ for some constant AE ¼. Hence, the preprocessing and communication slowdown of our techniques turn out to be constants in these practical cases. That is, fault-tolerance incurs no asymptotic slowdown in practical cases with our simulation techniques. This implies that efficient BSP and CGM algorithms for sorting [13] , list ranking [8] , convex hull [14] , and so on [14, 12] can be made resilient to a constant fraction of processor faults without any slowdown even when worst-case fault distributions are assumed. To our knowledge, no previous fault-tolerance results on any general model of computation have this strong property.
Algorithm overview
We first describe our simulations for the CGM model where ¢´Ò Ôµ. The modifications needed to make the simulations fully general on BSP will be described at the end of this paper. Our simulation for CGM consists of two parts: the preprocessing and the actual superstep-bysuperstep simulation of the simulated algorithm. Let È´ µ, ½ Ô, denote the -th processor in an ideal CGM machine. The number will be called the identification of È´ µ. If È´ µ is not faulty, we say that È´ µ is active. For simplicity, we assume that all divisions, square-roots, and logarithms we use give integers.
In the preprocessing we first identify the number of active processors among È´ µ, ½ 
Preprocessing
We describe the preprocessing that requires Ç´´ÐÓ Ôµ ¾ µ communication rounds. We say that a group of processors is good if the ratio of active processors in the group is at least . A good group of processors that has active processors in total is ranked if is known by all the active processors in the group and each active processor is assigned a unique rank in ½ 
Ranking
The ranking is performed by constructing an -ary tree consisting of all the active processors in È´½µ ÈÔµ .
The ranking consists of ÐÓ Ô phases.
The following notations are used for trees. Let Ì be a tree. ÊÓÓØ´Ì µ denotes the root node of the tree. Ä Ú Ð´Ùµ We describe the first phase. Fix one ´½ µ. There are processors in ´½ µ. Each active processor in ´½ µ sends a query message to all other processors in ´½ µ.
Then all the active processors in a group know which processors in the group are active. An active processor (say, one with the largest identification) is selected as the leader of a group. If a group is good, the leader of the group forms an -ary tree of height ½ such that the leader is the root and all other processors of the group are the leaves, and ranks all the processors such that the leader has the largest rank.
We now state the phase invariants after the -th phase.
1. The active processors of a good group ´ µ, ¼
Note that the phase invariants are satisfied after the first phase. A general -th phase consists of two subphases: treeconstruction and ranking subphases.
The tree-construction subphase
For simplicity we describe the computation in ´ ½µ Note that ´ ½µ is the union of ´ ½ ½µ ´ ½ µ. Each ´ ½ µ is called a subgroup of ´ ½ ½µ. The tree-construction subphase consists of the following steps. The details are omitted.
It can be shown that the above procedure is implemented on CGM in Ç´ÐÓ Õµ communication rounds and that À Ø´Ì ¼ µ is Ç´ µ.
The ranking subphase
The total number of active processors is counted by ranking all the processors, using a prefix-sums computation on Ì ¼ 
Building a generalized butterfly
We build a generalized butterfly using the active processors. The generalized butterfly we will build is a´¾ ÐÓ Õµ- Now we describe how to form a generalized butterfly network with the active processors. All the trees we will construct from Ì satisfy Property 1. We will use a procedure that splits a tree Í satisfying Property 1 into trees Í ¼ Í ½ so that each Í , ¼ , contains processors with consecutive ranks and satisfies Property 1. The procedure can be easily implemented by generalizing the tree splitting technique in [28] so that the number of communication rounds required is proportional to the height of Í and each processor performs communication and local computation proportional to in each communication round, regardless of . Also, we can implement the procedure so that each ÊÓÓØ´Í µ knows the identifications of ÊÓÓØ´Í ¦½ µ. The details are omitted. From now on, "evenly split a tree" means the application of this procedure.
Split into columns
Evenly split Ì into trees Ì ¼ Ì ½ Ì ½ (so that ÓÐ´ µ are in Ì ). Note that ÊÓÓØ´Ì µ, ¼ , knows the identifications of ÊÓÓØ´Ì ¦½ µ. This requires Ç´ÐÓ Õµ communication rounds since À Ø´Ì µ is Ç´ÐÓ Õµ.
Build row connections
Let denote the forest consisting of Ì , ¼ . We will repeatedly split a forest into several forests (i.e., split the trees in a forest and combine resulting trees into forests) while maintaining the following invariant, which is satisfied by . Given a tree Ì consisting of active processors, the rank of a processor É´ µ within Ì is defined to be the number of processors in Ì having smaller ranks than that of É´ µ. 
Superstep-by-superstep simulation
We already mentioned that the simulation of local computations is easy. The simulation of communications in one superstep can be considered to be a routing problem in the generalized butterfly where each processor sends and receives messages. We present two routing techniques. One is a deterministic technique and the other is a randomized one. For simplicity, we assume that an active processor simulates one ideal processor rather than ½ ideal processors (i.e., at most messages are sent and received by a processor) because the extension to ½ ideal processors is trivial and incurs only constant slowdown. We first describe how to route a single message in the generalized butterfly.
Routing in the generalized butterfly
We describe the way of using the generalized butterfly to route the messages by showing how one message Ñ originating from an arbitrary processor ´ µ can be routed to another arbitrary processor ´ ¼ ¼ µ. Let 
Deterministic routing
Butterfly routing does not give a good deterministic bound because of a well-known lower bound [23] , which implies that there are cases where the number of messages that pass through a processor in S2 is as large as Ô Õ if we route all the messages to their respective target rows.
In our situation, however, a processor É´ µ can communicate directly with any processor whose identification is known to É´ µ, and thus the messages do not have to really pass through a processor, and it is possible to efficiently route a large number of messages through a processor. The idea is to form a tree (called a distance tree) containing the set of messages that pass through a processor and route only the root of the tree through the processor. Trees must be split and merged because a processor is connected to several processors. A similar idea was used by Chlebus et al. [5] in a different context (in a linked list). We need a nontrivial generalization to the generalized butterfly because of load balancing among the processors.
High-level description of deterministic routing
We give a high-level description without giving the details of the distance tree (i.e., the structure of the tree and how the nodes of the trees are distribued among active processors) we will use. We only mention that ´ µ uses a memory linear in the number of messages in Å´ µ. For the time being, we will focus on the messages that originate from ÓÐ´¼µ. The processors in ÓÐ´¼µ will be called the owners and all the processors (including the owners) will be called routers when they are used to identify the path that messages should follow in S2. The messages originating from other columns will be considered later. We use the following notations. In S2, for a router ´ µ, ¼ , the set of messages that come into ´ µ and the set of messages go out of ´ µ are the same. But for ´ ¼µ the two sets are not the same. Let Å´ µ denote the set of messages that is routed out of ´ µ in stage S2 of the butterfly routing. For ´ ¼µ, the set of messages that come into ´ ¼µ as the result of S2 is denoted by Å´ µ. Å´ µ denotes the subset of Å´ µ that is routed to ´ µ. ´ µ and ´ µ denote the distance tree containing the messages in Å´ µ and Å´ µ, respectively.
We describe the deterministic routing stage by stage.
First, S1 is not needed for messages originating from ÓÐ´¼µ since the messages are initially in ÓÐ´¼µ.
Deterministic implementation of S2 consists of the implementation of routing rounds. Initially, each ´ ¼µ is constructed in ´ ¼µ. We maintain an invariant that the processor that holds each ÊÓÓØ´ ´ µµ knows the identification of ´ µ, which is initially satisfied since each ´ ¼µ holds the entire ´ ¼µ. Each routing round consists of two parts. In the first part each ´ µ is split into ´ ¼ µ ´ Ô ½µ and each ÊÓÓØ´ µ is sent to ´ µ (the identification can be obtained from ´ µ). In the second part, called the merge part, the (splited) distance trees that are routed to ´ · ½ µ are merged into ´ · ½ µ . We mention that ´ · ½ µ participates in the construction of ÊÓÓØ´ ´ ·½µµ so that the above mentioned invariant is satisfied. After all the routing rounds are complete, we will have ´ µ's for each and S2 is complete.
For S3, fix a row . By the above mentioned invariant, the processor É´×µ holding ÊÓÓØ´ ´ µµ knows the identification of ´ ¼µ. The deterministic implementation of S3 consists of rounds numbered from ¼ to ½. In round , the identification of ´ µ is broadcast (through the tree structure of ´ µ) to all the processors holding the messages in ´ µ. Then, the messages in Å´ µ whose target is ´ µ is sent to ´ µ using the broadcast identification. For the next round, É´×µ obtains the identification of ´ · ½ µ using the row connection of ´ µ.
Distance tree
The operations we use on a distance tree ´ µ are as follows. Initially, we sequentially construct the distance trees ´ ¼µ, ¼ Ö . In S2, we split ´ µ into ´ µ's and merge at most Ô distance trees into one distance tree. In S3, we broadcast a value to the processors holding the nodes of the distance tree. A data structure that can facilitate the four operations while using only linear memory is the compacted trie.
The distance tree ´ µ is defined to be a compacted trie containing the messages in Å´ µ. By defining the key (used in the distance tree) of each message Ñ in Å´ µ, we can implement the four operations efficiently. Let Ì Ê ¼´Ñ µ be Ö´Ñµ ¡ Ö · Ì Ê ´Ñµ. The key of a message Ñ in ´ µ is defined to be the sequence of blocks ´Ì Ê ¼´Ñ µ µ ´Ì Ê ¼´Ñ µ µ ´Ì Ê ¼´Ñ µ ¾´ÐÓ Ò · ÐÓ Öµµ. We use Ì Ê ¼´Ñ µ rather than Ì Ê ´Ñµ because the key of a message Ñ must be unique in a trie. Note that the key of a message Ñ can be represented using a constant number of words.
Since the number of messages in ´ µ can be proportional to Ô Õ, which can be larger than Ç´Ò Õµ, the memory each processor has, the nodes of ´ µ are distributed among the owners with the restriction that one node is stored in one processor.
See Fig. 1 for an example distance tree at a router in column ¿. Assume that the edge from Ù to Ú contains two blocks of key. The messages Ñ in the subtree under the 00 11 0 1 00 11 00 11 00 11 00 11 000 111 000 111 00 11   0  0  1  1  00  00  11  11  0  0  1  1  00  00  11  11  00  00  11  11   00  00  11  11  0  0  1  1  00  00  11  11  0  0  1  1  00  00  11  11   0  0  1  1  0  0  1  1  00  00  11  11  0  0  1  1  00  00  11  11  000 111 000 111 00 11 00 11   00  00  11  11  00  00  11  11  0  0  1  1  00  00  11  11  00  00  11  11   0  0  1  1  00  00  11  11  0  0  1  1  00  00  11  11  00  00  11  11  000 111 00 11 000 111 000 111  0  0  1  1  0  0  1  1  00  00  11  11  0  0  1  1  0  0  1  1  00 11 000 111 00 11 00 Of the four operations, the only difficult operation to implement is the merge operation, which includes a nontrivial load balancing procedure. We just mention that in order to construct each distance tree ´ µ, ¼ Ö , the merge is performed level-by-level from the top level of ´ µ using only two levels at a time.
Pipelining and the cost of deterministic routing
As described, the routing of the messages originating from all columns takes Ç´´ÐÓ Õµ µ communication rounds since stage S2 of the routing of the messages originating from one column takes Ç´´ÐÓ Õµ ¿ µ communication rounds. We use pipelining twice to reduce the number of communication rounds to Ç´´ÐÓ Õµ ¾ µ.
We first pipeline the merge operations in different routing rounds. We can split a distance tree ´ µ immediately after the root is constructed because the split operation uses only the root of a distance tree. The merge construction of all ´ · ½ µ , ¼ Ö , simultaneously starts after the top ¿ levels of ´ µ, ¼ Ö , are constructed. Since the implementation of the merge operation uses only two levels at a time, the merges in different routing rounds do not interfere with each other.
We next pipeline the routing of messages originating from all columns. First, the purpose of S1 is to reach ÓÐ´¼µ so it can be replaced by the broadcast of the identification of ´ ¼µ in each ÖÓÛ´ µ. For S2, processors in ÓÐ´ µ starts S2 by performing routing round ¼ when the processors in ÓÐ´¼µ performs routing round . In the same way, processors in ÓÐ´ µ starts S3 when processors in ÓÐ´¼µ performs round of S3. We have shown the result R1 for CGM.
Randomized routing
When all the messages are routed simultaneously, there is a well-known randomized solution that requires asymptotically the same number of communication rounds as when only one-message is routed. Specifically, we use the following result due to Ranade [24, 25, 26, 27, 23] . Since we can use Ç´Ô µ messages per communication round and Ç´Ô µ buffer for one connection, we can pipeline the routing so that the number of communication rounds needed are reduced by a factor of Ô . Thus we have the result R2 for CGM.
Simulations for BSP
There are problems in applying the CGM simulation techniques to BSP since the bandwidth , which is fixed to be ¢´Ò Õµ in CGM, may change from superstep to superstep in BSP. Although most BSP algorithms use fixed values for , they may not be ¢´Ò Õµ. The bandwidth can even be larger than Ò Õ, the memory each processor has [11] .
Hence, the first problem is how to set the bandwidth in the BSP simulation.
We assume that Ú , which is the average bandwidth of all the supersteps in the simulated algorithm, is given to the simulation. We will set as large as possible according to the relationship among the values Ú , Ä , and Ò Õ. The preprocessing of the BSP simulation is the same as that of the CGM simulation. The preprocessing overhead will be Ç´´ÐÓ Ôµ ¾ µ communication rounds with bandwidth Ç´ µ. Also, the local computations can be simulated with constant slowdown as in the CGM simulation.
We can show that the simulation of communications is simulated within the claimed slowdown in R1 and R2, even when the bandwidth of a superstep is different from Ú . The details are omitted.
