We introduce a computation model for developing and analyzing parallel algorithms on distributed memory machines. The model allows the design of algorithms using a single address space and does not assume any particular interconnection topology. We capture performance by incorporating a cost measure for interprocessor communication induced by remote memory accesses. The cost measure includes parameters re ecting memory latency, communication bandwidth, and spatial locality. Our model allows the initial placement of the input data and pipelined prefetching. We use our model to develop parallel algorithms for various data rearrangement problems, load balancing, sorting, FFT, and matrix multiplication. We show that most of these algorithms achieve optimal or near optimal communication complexity while simultaneously guaranteeing an optimal speed-up in computational complexity.
Introduction
Parallel processing promises to o er a quantum leap in computational power that is likely to have a substantial impact on various aspects of the computing eld, and that in particular can be exploited to investigate a wide range of what has been called \grand challenge" problems in science and engineering. It is widely recognized 23] that an important ingredient for the success of this technology is the emergence of computational models that can be used for algorithms development and for accurately predicting the performance of these algorithms on real machines. We take a similar view as in 24] in that the computation model should be a \bridging model" that links the two layers of hardware and software. Existing computation models tend to be biased towards one or the other layer, except for very few exceptions. The Bulk Synchronous Parallel (BSP) model advocated by Valiant 24] is one of the few exceptions. In this paper, we introduce a computation model that speci cally attempts to be a bridging model between the shared memory (single address) programming model and the distributed-memory message passing architectures. Distributed memory systems con gured as a single address space are usually referred to as (scalable) shared memory multiprocessors. These machines achieve the scalability of distributed memory architectures and the simple programming style provided by the single address space. Our model can also be used for predicting performance of data parallel algorithms running on distributed memory architectures.
Since a computation model should predict performance on real machines, we start with a discussion on the basis of our measure of communication costs incurred by accessing remote data. As indicated in 8], the hardware organizations of massively parallel processors (MPPs) seem to be converging towards a collection of powerful processors connected by a communication network that can be modeled as a complete graph on which communication is subject to the restrictions imposed by the latency and the bandwidth properties of the network. According to this common organization, the communication between the di erent processors is handled by pointto-point messages whose routing times are controlled by parameters related to the network latency, processor communication bandwidth, overhead in preparing a message, and network capacity. Such a model avoids a description of the exact structure of the network since algorithms exploiting speci c features of the network are less likely to be robust enough to work well on a variety of architectures and to adapt easily to possible future technological changes. However programming the machine at the message-passing level imposes a heavy burden on the programmer and makes algorithms development and evaluation quite complicated. On the other hand, the data-parallel and the shared-memory programming models are appealing in terms of their ease of use and in terms of their close relationship to sequential programming. Both models assume a single address space.
The Block Distributed Memory (BDM) model introduced in the next section captures the performance of shared memory (single address space) algorithms by incorporating a cost measure for interprocessor communication caused by remote memory accesses. The cost is modeled using the latency and the communication bandwidth of each processor. Since a remote memory access involves the transmission of a packet that typically contains a number of consecutive words, our model encourages the use of spatial locality by incorporating a parameter m that represents a cost associated with accessing up to m consecutive words; this cost will be incurred even if a single word is needed. Our model allows the initial placement of input data and includes the memory latency hiding technique of pipelined prefetching. Since we measure the amount of local computation and the amount of communication separately, we are able to normalize the communication cost and drop one parameter so as to make the analysis of the corresponding algorithms simpler. We use our model to develop parallel algorithms for various data rearrangement problems, load balancing, sorting, the Fast Fourier Transform (FFT) computation, and matrix multiplication. We show that most of these algorithms achieve optimal or near optimal communication complexity while simultaneously guaranteeing an optimal speed-up in computational complexity.
In the next section, we provide the details of our model while Section 3 describes a collection of algorithms for handling data rearrangements that occur frequently in shared memory algorithms. The load balancing problem is addressed in Section 4 where a communication e cient algorithm is presented, and Section 5 is devoted to the presentation of e cient algorithms for sorting, FFT, and matrix multiplication. Most of the resulting algorithms seem to share a common structure with high-performance algorithms that have been tested on real machines.
The Block Distributed Memory (BDM) Model
Our computation model, the Block Distributed Memory (BDM), will be de ned in terms of four parameters p, , , and m. As we will see later, the parameter can be dropped without loss of generality. The parameter p refers to the number of processors; each such processor is viewed as a unit cost random access machine (RAM). In addition, each processor has an interface unit to the interconnection network that handles communication among the di erent processors. Data are communicated between processors via point-to-point messages; each message consists of a packet that holds m words from consecutive locations of a local processor memory. Since we are assuming the shared memory programming model, each request to a remote location involves the preparation of a request packet, the injection of the packet into the network, the reception of the packet at the destination processor, and nally the sending of a packet containing the contents of m consecutive locations, including the requested value, back to the requesting processor. We will model the cost of handling the request to a remote location ( read or write) by the formula + m , where is the maximum latency time it takes for a requesting processor to receive the appropriate packet, and is the rate at which a processor can inject (receive) a word into (from) the network. Moreover, no processor can send or receive more than one packet at a time. As a result we note the following two observations. First, if is any permutation on p elements, then a remote memory request issued by processor P i and destined for processor P (i) can be completed in +m time for all processors P i , 0 i p ? 1, simultaneously. Second, k remote access requests issued by k distinct processors and destined to the same processor will require k( + m ) time to be completed; in addition, we do not make any assumption on the relative order in which these requests will be completed.
Most current interconnection networks for multiprocessors use several hardware and software techniques for hiding memory latency. In our model, we allow pipelined prefetching for hiding memory latency. In particular, k prefetch read operations issued by a processor can be completed in + km time.
The underlying communication model for BDM is consistent with the LogP and the postal models 8, 13, 5] but with the addition of the parameter m that incorporates spatial locality. However, our model does not allow low-level handling of message passing primitives except implicitly through data accesses. In particular, an algorithm written in our model can specify the initial data placement among the local memories of the p processors, can use the processor id to refer to speci c data items, and can use synchronization barriers to synchronize the activities of various processors whenever necessary. Remote data accesses are charged according to the communication model speci ed above. As for synchronization barriers, we make the assumption that, on the BDM model, they are provided as primitive operations. There are two main reasons for making this assumption. The rst is that barriers can be implemented in hardware e ciently at a relatively small cost. The second is that we can make the latency parameter large enough to account for synchronization costs. The resulting communication costs will be on the conservative side but that should not a ect the overall structure of the resulting algorithms.
The complexity of a parallel algorithm on the BDM model will be evaluated in terms of two measures: the computation time T comp , and the communication time T comm . The measure T comp refers to the maximum of the local computation performed on any processor as measured on the standard sequential RAM model. The communication time T comm refers to the total amount of communication time spent by the overall algorithm in accessing remote data. Our main goal is the design of parallel algorithms that achieve optimal or near-optimal computational speedups, that is, T comp O( Tseq p ), where T seq is the sequential complexity of the problem under consideration, in such a way that the total communication time T comm is minimized.
Since T comm is treated separately from T comp , we can normalize this measure by dividing it by . The underlying communication model for BDM can now be viewed as the postal model 5] but with the added parameter m re ecting spatial locality. Hence an access operation to a remote location takes +m time, and k prefetch read operations can be executed in + km time. Note that the parameter should now be viewed as an upper bound on the capacity of the interconnection network, i.e., an upper bound on the maximum number of words in transit from or to a processor. In our estimates of the bounds on the communication time, we make the simplifying (and reasonable) assumption that is an integral multiple of m.
We believe that locality is an important factor that has to be taken into consideration when designing parallel algorithms for large scale multiprocessors. We have incorporated the parameter m into our model to emphasize the importance of spatial locality. The notion of processor locality also seems to be important in current multiprocessor architectures; these architectures tend to be hierarchical, and hence the latency is much higher for accessing processors that are further up in the hierarchy than those that are \close by". This feature can be incorporated into our model by modifying the value of to re ect the cost associated with the level of hierarchy that needs to be used for a remote memory access. This can be done in a similar fashion as in the memory hierarchy model studied in 2] for sequential processors. However in this paper we have opted for simplicity and decided not to include the processor locality into consideration.
Several models that have been discussed in the literature, other than the LogP and the postal models referred to earlier, are related to our BDM model. However there are signi cant di erences between our model and each of these models. For example, both the Asynchronous PRAM 9] and the Block PRAM 1] assume the presence of a shared memory where intermediate results can be held; in particular, they both assume that the data is initially stored in this shared memory. This makes data movement operations considerably simpler than in our model. Another example is the Direct Connection Machine (DCM) with latency 14] that uses message passing primitives; in particular, this model does not allow pipelined prefetching as we do in the BDM model.
Basic Algorithms for Data Movements
The design of communication e cient parallel algorithms depends on the existence of e cient schemes for handling frequently occurring transformations on data layouts. In this section, we consider data layouts that can be speci ed by a two-dimensional array A, say of size q p, where column i of A contains a subarray stored in the local memory of processor P i , where 0 i p ? 1. A transformation on the layout A will map the elements of A into the layout (A) not necessarily of the same size. We present optimal or near optimal algorithms to handle several such transformations including broadcasting operations, matrix transposition, and data permutation. All the algorithms described are deterministic except for the algorithm to perform a general permutation.
We start by addressing several broadcasting operations. The simplest case is to broadcast a single item to a number of remote locations. Hence the layout A can be described as a one-dimensional array and we assume that the element A 0] has to be copied into the remaining entries of A. This can be viewed as a concurrent read operation from location A 0] executed by processors P 1 ; P 2 ; : : :; P p?1 . The next lemma provides a simple algorithm to solve this problem; we later use this algorithm to derive an optimal broadcasting algorithm. We are now ready for the following theorem that essentially establishes the fact that a k-ary balanced tree broadcasting algorithm is the best possible for k = m + 1 (recall that we earlier made the assumption that is an integral multiple of m). We next establish the lower bound stated in the theorem. Any broadcasting algorithm using only read, write, and synchronization barrier instructions can be viewed as operating in phases, where each phase ends with a synchronization barrier (whenever there are more than a single phase). Suppose there are s phases. The amount of communication to execute phase i is at least + k i m, where k i is the maximum number of copies read from any processor during phase i. Hence the total amount of communication required is at least P s i=1 ( + k i m). Note that by the end of phase i, the desired item has reached at most (k 1 + 1)(k 2 + 1) (k i + 1) remote locations. It follows that, if by the end of phase s, the desired item has reached all the processors, we must have p (k 1 + 1)(k 2 + 1) (k s + 1). The communication time P s i=1 ( + k i m) is minimized when k 1 = k 2 = = k s = k, and hence (k + 1) s p. Therefore s log p log(k+1) and the communication time is at least s( + km) ( + km) logp log(k+1) . We complete the proof of this theorem by proving the following claim.
Claim: +km log(k+1) log( m +2) + m, for any k 1.
Proof of the Claim: Let r = m , f 1 (k) = log(k+1) , f 2 (k) = km log(k+1) , and f(k) = f 1 (k) + f 2 (k) = +km log(k+1) . Then, f 0 (k) = m(k + 1) log(k + 1) ? (log e)( + km) (k + 1) log 2 (k + 1) :
(Case 1) (1 k r + 1): Since f 1 (k) is decreasing and f 2 (k) is increasing in this range, the claims follows easily by noting that f 1 (k) f 1 (r + 1) = log( m +2) and Our next data movement operation is the matrix transposition that can be de ned as follows. Let q p and let p divide q evenly without loss of generality. The data layout described by A is supposed to be rearranged into the layout A 0 so that the rst column of A 0 contains the rst q=p consecutive rows of A laid out in row major order form, the second column of A 0 contains the second set of q=p consecutive rows of A, and so on. Clearly, if q = p, this corresponds to the usual notion of matrix transpose.
An e cient algorithm to perform matrix transposition on the BDM model is similar to the algorithm reported in 8]. There are p ? 1 rounds that can be fully pipelined by using prefetch read operations. During the rst round, the appropriate block of q=p elements in the ith column of A is read by processor P (i+1)modp into the appropriate locations, for 0 i p ? 1. During the second round, the appropriate block of data in column i is read by processor P (i+2)modp , and so on. Proof: For the rst algorithm, we use a k-ary tree as in the single item broadcasting algorithm described in Theorem 3.1, where k = m +1. Using the matrix transposition strategy, distribute the n elements to be broadcast among k processors, where each processor receives a contiguous block of size n k . We now view the p processors as partitioned into k groups, where each group includes exactly one of the processors that contains a block of the items to be broadcast. The procedure is repeated within each group and so on. A similar reverse process can gradually read all the n items into each processor. Each forward or backward phase is carried out by using the cyclic data movement of the matrix transposition algorithm. One can check that the communication time can be bounded as follows. The complexity bounds are guaranteed to hold with high probability, that is, with probability 1 ? n ? , for some positive constant , as long as p 2 < n 6 lnn , where ln is the logarithm to the base e.
The overall idea of the algorithm has been used in various randomized routing algorithms on the mesh. Here we follow more closely the scheme described in 20] for randomized routing on the mesh with bounded queue size.
Before describing our algorithm, we introduce some terminology. Step 1 ] Each processor P j distributes randomly its n p elements into the p slots of the jth column of A 0 .
Step 2 ] Transpose A 0 so that the jth slice will be stored in the jth processor, for 0 j p ? 1.
Step 3 ] Each processor P j distributes locally its cn p elements such that every element of the form (*,i) resides in slot i, for 0 i p ? 1.
Step 4 ] Perform a matrix transposition on A 0 (hence the jth slice of the layout generated at the end of Step 3 now resides in P j ).
end
The next two facts will allow us to derive the complexity bounds for our randomized routing algorithm. For the analysis, we assume that p 2 < n 6 lnn . Proof: The probability that an element is assigned to the jth slice by the end of
Step 1 is 1 p . Hence the probability that cn p 2 elements destined for a single processor fall in the jth slice is bounded by b( cn p 2 ; n p ; 1 p ) since no processor is the destination of more than n p elements. Since there are p slices, the probability that more than cn Remark: Since we are assuming that p 2 < n 6ln n , the e ect of the parameter m is dominated by the bound c n p (as n p > 6p ln n mp, assuming m 6 ln n). 2 
Load Balancing
Balancing load among processors is very important since poor balance of load generally causes poor processor utilization 19]. The load balancing problem is also important in developing fast solutions for basic combinatorial problems such as sorting, selection, list ranking, and graph problems 12, 21] . This problem can be de ned as follows. The load in each processor P i is given by an array A i 0 : n i ? 1], where n i represents the number of useful elements in P i such that max p?1 j=0 fn j g = M and P p?1 j=0 n j = n. We are supposed to redistribute the n elements over the p processors such that n p elements are stored in each processor, where we have assumed without loss of generality that p divides n.
In this section, we develop a simple and e cient load balancing algorithm for the BDM model. Step 1 ] Each processor P i reads the p ? 1 values n 0 ; : : : ; n i?1 ; n i+1 ; : : :; n p?1 held in the remaining processors. This step can be performed in at most 2 + m 2 + p communication time by Lemma 3.2.
Step 2 ] Processor P i performs the following local computations: Remark: The index t is chosen in such a way that, for i < t, processor P i will read n p elements, and for i t, P i will read n p ? m elements. The indices l i and r i will be used in the next step to determine the locations of the n p or n p ? m elements that will be moved to P i . Notice that this step takes O(p) computation time.
Step 3 ] Processor P i reads A l i +1 ; A l i +2 ; : : :; A r i ?1 , and reads appropriate numbers of elements from A l i and A r i respectively.
Remark: This step needs a special attention since there are cases when a set of consecutive processors read their elements from one processor, say P i . Assume that h processors, P i 0 ; P i 0 +1 ; : : :; P i 0 +h?1 , have to read some appropriate elements from P i . Notice that h d M n=p?m e + 1. Then this step can be divided into two substeps as follows: In the rst substep, h ? 1 processors, P i 0 ; P i 0 +1 ; : : :; P i 0 +h?2 , read their elements from each such P i ; this substep can be done in + M communication time by applying Corollary 3.1. In the second substep, the rest of the routing is performed by using a sequence of read prefetch operations since the remaining elements in each processor are accessed only by a single processor. Hence the total communication time required by this step is ( +M)+( + n p ) = 2 + M + n p .
Step 4 ] Processor P i , (i t), reads the remaining m elements from the appropriate processors; the corresponding indices l 0 i and r 0 i can be computed locally as in Step 2. Remark: This step can be completed in ( + m 2 ) communication time since each processor reads its m elements from at most m processors, and these reads can be prefetched. 
Sorting
We rst consider the sorting problem on the BDM model. Three strategies seem to perform best on our model; these are (1) column sort 15], (2) sample sort (see e.g. 6] and related references), and (3) rotate sort 18]. It turns out that the column sort algorithm is best when n 2p(p ? 1) 2 , and that the sample sort and the rotate sort are better when p 2 n < 2p(p ? 1) 2 .
The column sort algorithm is particularly useful if n 2p(p?1) 2 ; it can be implemented in at most (4 + 3 n p + 2pm) communication time with O( n log n p ) computation time. When n < 2p(p ? 1) 2 , the column sort algorithm is not practical since its constant term grows exponentially as n decreases. The sample sort algorithm is provably e cient when p 2 < n 6 lnn ; it can be implemented in (3 +(p?1)d 5 ln n m em+6 n p ) communication time and O( n logn p ) computation time with high probability. The rotate sort algorithm can be implemented in 8( + n p + pm) communication time with O( n log n p ) computation time, whenever n 6p 2 .
We begin our description with the column sort algorithm.
Column Sort
The column sort algorithm is a generalization of odd-even mergesort and can be described as a series of elementary matrix operations. Let A be an q p matrix of elements where qp = n, p divides q, and q 2(p ? 1) 2 . Initially, each entry of the matrix is one of the n elements to be sorted. After the completion of the algorithm, A will be sorted in column major order form. The column sort algorithm has eight steps. In steps 1, 3, 5, and 7, the elements within each column are sorted. In steps 2 and 4, the entries of the matrix are permuted. Each of the permutations is similar to matrix transposition of Lemma 3. Sample Sort
The second sorting algorithm that we consider in this section is the sample sort algorithm which is a randomized algorithm whose running time does not depend on the input distribution of keys but only depends on the output of a random number generator. We describe a version of the sample sort algorithm that sorts on the BDM model in at most (3 + (p ? 1)d 5 lnn m em + 6 n p ) communication time and O( n log n p ) computation time whenever p 2 < n 6 lnn . The complexity bounds are guaranteed with high probability if we use the randomized routing algorithm described in Section 3.
The overall idea of the algorithm has been used in various sample sort algorithms. Our algorithm described below follows more closely the scheme described in 6] for sorting on the connection machine CM-2; however the rst three steps are di erent.
Algorithm Sample Sort Input: n elements distributed evenly over a p-processor BDM such that p 2 < n 6 lnn .
Output: The n elements sorted in column major order. begin
Step 1 ] Each processor P i randomly picks a list of 5 ln n elements from its local memory.
Step 2 ] Each processor P i reads all the samples from all the other processors; hence each processor will have 5p ln n samples after the execution of this step.
Step 3 ] Each processor P i sorts the list of 5p ln n samples and pick (5 ln n + 1)st, (10 ln n + 1)st, : : : samples as the p ? 1 pivots.
Step 4 ] Each processor P i partitions its n p elements into p sets, S i;0 ; S i;1 ; : : : ; S i;p?1 , such that the elements in set S i;j belong to the interval between jth pivot and (j + 1)st pivot, where 0th pivot is ?1, pth pivot is +1, and 0 j p ? 1.
Step 5 ] Each processor P i reads all the elements in the p sets, S 0;i ; S 1;i ; : : :; S p?1;i , by using Algorithm Randomized Routing.
Step 6 ] Each processor P i sorts the elements in its local memory.
end
The following lemma can be immediately deduced from the results of 6].
Lemma 5.1 For any > 0, the probability that any processor contains more than n p elements after Step 5 is at most ne ?(1? 1 ) 2 We now describe the algorithm brie y; all the details appear in 18]. We begin by specifying three procedures, which serve as building blocks for the main algorithm. Each procedure consists of a sequence of phases that accomplish a speci c transformation on the array.
Procedure BALANCE: Input array is of size v w. For the complete correctness proof of the algorithm, see 18]. We can easily prove that this algorithm can be performed in at most 14 local sorting steps within each processor. We can also prove that each of the simple permutations can be done in ( + n p + pm) communication time in a similar way as in Lemma 3.3. Steps 1, 3, 5, and 6 can each be done with one simple permutation. Steps 2 and 4 also can each be done with one simple permutation by overlapping their second permutations with the rst permutations of steps 3 and 5 respectively. Originally, Step 7 is \Repeat SHEAR three times" which is designed for removing six \dirty rows" that are left after Step 5; hence, this step requires 6 simple permutations on our model. Since we assumed n p 6p, the length of each column is larger than that of each row, and we can reduce the number of the applications of SHEAR procedure in Step 7 by transposing the matrix in Step 6. Thus, since the assumption n p 6p implies that there are at most two dirty columns after the execution of Step 6, one application of procedure SHEAR is enough in Step 7 for removing the two dirty columns and we have the following theorem.
Theorem 5.3 Given n 6p 2 elements, n p elements in each of the p processors of the BDM model, the n elements can be sorted in column major order form in 8( + n p +pm) communication time and O( n log n p ) computation time. 2
Notice that if p 2 n < 6p 2 , we need to repeat SHEAR dlog(1 + 6p 2 n )e times at Step 7 for removing the dirty columns, and the communication time for Algorithm ROTATESORT is at most k( + n p + pm), where k = 6 + 2dlog(1 + 6p 2 n )e.
Other Sorting Algorithms
When the given elements are integers between 0 and p O(1) , the local sorting needed in each of the previous algorithms can be done in O( n p ) computation time by applying radix sort. Hence we have the following corollary. Two other sorting algorithms are worth considering: (1) radix sort (see e.g. 6] and related references) and (2) approximate median sort 2, 22] . The radix sort can be performed on our model in O(( b r ) n p ) computation time and ( b r )T comm (n; p) communication time, where b is the number of bits in the representation of the keys, r is such that the algorithm examines the keys to be sorted r-bits at a time, and T comm (n; p) is the communication time for routing a general permutation on n elements (and hence the bounds in the above corollary apply). The approximate median sort, which is similar to sample sort with no randomization used but with p ? 1 elements picked from each processor after sorting the elements in each processor, can be done on our model in O( n logn p ) computation time and in at most 3 + p 2 + 2 n p + T comm (n; p) 
