The Bulk-Synchronous Parallel model of computation has been used for the architecture independent design and analysis of parallel algorithms whose performance is expressed not only in terms of problem size n but also in terms of parallel machine properties. In this paper the performance of implementations of deterministic and randomized BSP sorting algorithms is examined. The deterministic algorithm is the one in [22, 28] that uses deterministic regular oversampling and parallel sample sorting and is augmented to handle duplicate keys transparently with optimal asymptotic efficiency. The randomized algorithm derives from the algorithm in [21] that is sample-sort based and uses oversampling and the ideas introduced with the deterministic algorithm. The resulting randomized design, however, works differently from traditional parallel sample-sort based algorithms and is also augmented to transparently handle duplicate keys with optimal asymptotic efficiency thus eliminating the need to tag all input keys and to double communication/computation time. Both algorithms are shown to balance the work-load evenly among the processors and the use and precise tuning of oversampling that the BSP analysis allows combined with the transparent duplicate-key handling insures regular and balanced communication. Both algorithms have been implemented in ANSI C and their performance (scalability and efficiency issues) has been studied on a distributed memory machine, a Cray T3D. The experimental results obtained from these implementations are reported in this work. The validity of the theoretical model is also tested; based on the theoretical performance of each algorithm under the BSP model and the BSP parameters of a Cray T3D, it is possible to estimate the actual performance of the implementations based on the theoretical performance of the designed algorithms.
Introduction
One of the early attempts to model a parallel computer has been the Parallel Random Access Machine (PRAM) [19] which is one of the most widely studied abstract models of a parallel computer. The advantage of the PRAM model in terms of algorithm design is its simplicity in capturing parallelism and abstracting away the communication requirements of parallel computing such as communication latency, block transfers, memory and network conflicts during routing, bandwidth of interconnection networks, interprocessor communication, memory management and synchronization. In addition, most of the algorithms designed for the PRAM model work under unlimited parallelism assumptions, where the number of processors utilized to solve a problem is a function of problem size. It is these advantages that make the PRAM easy to program and popular for algorithm development. The success of the PRAM model is witnessed by the hundreds of algorithms that have been designed and extensively analyzed on various variants of the PRAM.
In parallel machines that have been built so far, however, communication is expensive, synchronization is not instantaneous and unlimited parallelism is unavailable. The majority of parallel machines and available parallel software do not allow programmers to write programs that are both efficient, portable and scalable; such objectives are only feasible by fine-tuning parallel code and taking into account the finest details of the underlying architecture and the programming interface that is available to the programmer. Parallel algorithm design, on the other hand, usually ignores communication and/or synchronization issues and works only under unlimited parallelism assumptions (or by simulations on limited processor models). Many parallel algorithms are thus designed having in mind factors not present in any reasonable machine, such as zero communication delay or infinite bandwidth.
The need for architecture independent parallel algorithm design is thus immediate for effective and efficient programming of existing and future parallel hardware platforms. Such an algorithm design would work in two steps. First, an abstraction of parallel hardware is obtained to allow the performance of a parallel machine to be described in some general architecture independent terms through a number of parameters that reflect the computation, communication, and synchronization capabilities of the hardware. Then, the performance of a parallel algorithm would be expressed as a function of these parameters and problem size similarly to traditional sequential algorithm design. This way it would be possible to describe the performance of a parallel algorithm not only on existing machines but also on machines not currently built as long as sufficient information on the not-yet-built machines can be obtained (i.e. the parameters of the abstraction become available).
The introduction of realistic parallel computer models such as the Bulk-Synchronous Parallel (BSP) model [65, 66] and the LogP [15] model and their extensions ( [47, 7] ) come to address these limitations of parallel computing. Both models take into reasonable consideration synchronization, communication, and latency issues related to both communication, bandwidth limitations and network conflicts during routing.
A considerable amount of work has been devoted in the study of the BSP model, and the analysis and design of BSP algorithms. In this work, we present parallel implementations, using the Oxford BSP Toolset, BSPlib [43] , of deterministic and randomized BSP sorting algorithms on a distributed memory machine, a Cray T3D. A report is also included on the performance and relative merits of these algorithms and the knowledge gained from the study of the predicted theoretical performance and its comparison to actual running time. A comparison of our algorithm implementations to other parallel sorting algorithm implementations is also presented.
are expressed in terms of n, though it is not assumed so.
The charging policy for local operations on the BSP model for the sorting algorithms is described. Since the mechanisms of the BSP model are similar to those used by sequential models and performance ratios of these two are important, operations can be defined in a higher level than machine level instructions. A dominant term in the BSP algorithms to be described later is contributed by sequential sorting performed independently at each processor. A charge of n lg n time for sorting n keys sequentially [49] , and n lg q time for merging q lists of total size n [49] are claimed. For other operations, O(1) time is charged for operations over an associative operator in parallel-prefix computations and O(1) time for each comparison performed. Finally, ⌈lg n⌉ comparisons are charged for performing binary search in a sorted sequence of length n − 1.
Sorting on the BSP model: An overview
A significant amount of work has been devoted in the study of the BSP model [64, 65, 66, 53, 54, 55, 47, 7] and the analysis [7, 10, 11, 21, 22, 36] and design of BSP algorithms [8, 9, 23, 24, 25, 26, 27, 53, 54, 55] .
The problem of parallel sorting n keys [51] has drawn considerable attention. First approaches were sorting networks introduced by Batcher [5] . Since then, a wealth of literature concerning parallel sorting has been published [50, 51] . The first sorting network to achieve, however, the optimal O(lg n) time bound to sort n keys is the AKS [2] network of width O(n). Subsequently, Reif and Valiant [59] presented a randomized sorting algorithm that runs on a fixed-connection network and achieves the same time bound as the AKS algorithm; the constant multipliers in the randomized algorithm are, however, considerably smaller. Cole [14] was the first to present optimal O(lg n) time sorting algorithms for n-processor CREW and EREW PRAMs. These methods fail, however, to deliver optimal performance when directly implemented on the BSP model of computation.
Note: In the expressions for running time on the BSP model below, terms that involve L are included in the communication -and not in computation -time, to simplify exposition. Polynomial slack would mean that n p = n/p is a constant degree polynomial, i.e. p = n 1−t for some constant t such that 0 < t < 1.
Previously known results on BSP sorting include deterministic [1, 10, 22, 27, 36, 62] and randomized [21, 24, 30] algorithms.
An adaptation of the AKS sorting network [2] on the BSP model is shown in [10] to yield computation and communication time O(n p lg n) and gO(n p lg p) + LO(lg p) respectively. The constant factors hidden in this algorithm, however, is considerably large and not of practical significance.
It is shown in [1] that for all values of n and p such that n ≥ p, a BSP version of column-sort [50] requires O(ζ 3.42 n p lg n p ) and O(gζ 3.42 n p ) + O(Lζ 3.42 ) time for computation and communication respectively, where ζ = lg n/ lg (n/p).
In [10] , it is shown that an adaptation on the BSP model of the cube-sort algorithm [16] requires computation time O (25 lg analysis is discussed in [52] . Given any O(n lg n) worst-case running time sequential sorting algorithm, the randomized BSP algorithm utilizes the sequential algorithm for local sorting and exhibits parallel efficiency which is asymptotically 100% provided that the BSP parameters are appropriately bounded (and these bounds accommodate most, if not all, currently built parallel machines). The constants hidden affect low order terms and are well defined. This has been one of the first attempts to establish in some general way parallel efficiency for an algorithm by taking into consideration communication and synchronization delays and describing optimality criteria not in absolute terms, that may be meaningless, but by comparing the proposed parallel algorithm to the best sequential one. The bounds on processor imbalance during sorting are tighter than any other random sampling/oversampling algorithm [46, 20, 59, 60] . The randomized algorithm has been implemented on realistic machines (SGI Power Challenge, Cray T3D and IBM SP2) and exhibited performance well within the bounds indicated by the theoretical analysis ( [4, 31] ). We note that sample-based sorting algorithms are extensively studied and implemented in the literature [12, 42] . For example, the sample-sort algorithm in [12] is computationally equivalent to the one round case of [21] . The analysis of [21] is however much tighter. It allows better control of oversampling with the end result of achieving smaller bucket expansion (key imbalance during the routing operation). The work of [42] applied to d-dimensional meshes is similar to the general algorithm of [21] .
In [27, 30] , it is shown that asymptotical performance comparable to that of [21] can be achieved for values of p much closer to n, i.e., p = ω(n lg lg n/ lg n) by a new randomized sorting algorithm. The failure probability is, however, significantly lower. As the improvements of this algorithm are interesting only for large p and the corresponding conditions on L and g for large p can be satisfied only by machines with very low L and g (i.e. PRAM-like behaving machines), the practical implications of this algorithm are less significant and interesting than its theoretical merit.
Under the same realistic assumptions as [21] , a new deterministic sorting algorithm is introduced in [22] , and a bound of (1 + (⌊ζ(1 − ζ −1 )⌋((1 − ζ −1 )/2) + o(1))(n lg n/p) and O(gζn p ) + O(L ζ) is shown for computation and communication respectively, thus improving upon the upper bounds of [1, 2, 10, 16, 50] . The bound on computation is subsequently improved in [27, 28] . In particular, for p = n/ lg 2+α n, α = Ω(1), the improved deterministic sorting algorithm in [27, 28] requires computation and communication time (1 + 2/α + o(1)) n lg n/p and O(g ζ n p ) + O(L ζ) respectively. The algorithm performs deterministic sorting by extending regular sampling, a technique introduced in [61] , to perform deterministic regular oversampling. Past results that use regular sampling have been available for cases with p 2 < n. The BSP algorithm in [22, 27, 28] further extends the processor range and achieves asymptotically optimal efficiency for a range of n/p that is very close to that ratio of the randomized BSP algorithms in [21, 30] . This was made possible by a detailed and precise quantification of the input key imbalance among the processors during the phases of deterministic sorting thus contributing to the understanding of regular oversampling. By using regular oversampling in a deterministic setting, it is possible to regulate the oversampling ratio to bound the maximum key imbalance among the processors. As in the case of randomized sorting, the insistence, under an architecture independent algorithm design, of satisfying the criterion of oneoptimality (i.e. optimal speedup/efficiency) led to these improvements. In this work we use the structure of this algorithm to derive a sample-based randomized sorting algorithm that does not follow the standard pattern of sample and splitter select, key routing and local sorting but instead follows the pattern of the deterministic algorithm i.e. local sort, sample and splitter select, key routing and local merging. In [36, 62] independent BSP adaptations of parallel merge-sort [14] are presented that for all values of n and p such that p ≤ n, require computation and communication/synchronization time O(n lg n/p) and O(gζn p ) + O(Lζ) respectively. As the BSP adaptation of parallel merge-sort is at least as involved as the original parallel merge-sort (EREW PRAM variant) algorithm [14] , the constant factors involved in the analysis are considerably large and the algorithm seems to be of little practical use as opposed to the algorithms in [21, 22, 28] .
Contents of the paper
In this work we study an implementation of the deterministic algorithm in [27, 28] that uses deterministic regular oversampling, which extends the notion of regular sampling of [61] , and parallel sample sorting that allows the algorithm to work for a wider range of processor size p. The deterministic algorithm is also augmented to handle transparently duplicate keys with optimal asymptotic efficiency. Our method of duplicate-key handling tags only a fraction o(n) of the n input keys, and does not double communication and/or computation time that other duplicate handling approaches may require [39, 40, 41] . The idea of using deterministic regular-oversampling in a deterministic sorting algorithm and the accurate quantification of its effects by measuring constants under the BSP model, results in a fine and quantifiable balance of computational load and communication time. Combined with our transparent and efficient duplicate key handling it leads to an algorithm that maintains its optimal performance even if all keys are the same.
The basis of the randomized algorithm is the foundation of randomized sample-sort algorithms [46, 20, 59, 60] that use oversampling [59] and in particular the BSP adaptation discussed in [21] combined with the ideas used in the deterministic algorithm. The resulting randomized sorting algorithm whose implementation is studied works differently from other sample-sort based algorithms including the one in [21] . The algorithm first local sorts, then sample and splitter selects, routes keys, and in the very end p-way merges sorted sequences as opposed to traditional sample-sort algorithms that sample and splitter select, route keys, and local sort. Our approach augmented with the transparent and efficient duplicate-key handling also used in the deterministic algorithm results in improved parallel performance, work balance and communication minimization.
Even though the two algorithms look similar, randomized oversampling is provably superior to deterministic regular-oversampling. The oversampling parameter in the former case can vary more widely than the corresponding one of the latter case thus resulting in more balanced communication and balanced work-load among the processors.
Parallel sorting algorithms originally designed for other models of computation (e.g. bitonic-sort [5, 29] ) are also implemented. The implementation platform is a distributed memory machine, a Cray T3D, and the communication library used is the Oxford BSP Toolset [43] .
The introduced algorithms are designed and analyzed in an architecture independent setting. Such a design is helpful in deciding the best way to implement various parallel operations optimally on the given platform based on the knowledge of the BSP parameters of the target platform only. For example, problem size n of the experiments, and machine configuration parameters as expressed in terms of the BSP parameters p, L and g determine the choice of broadcasting and parallel prefix algorithms in parallel sorting. It is also used in determining the choice of values of certain parameters of the algorithm implementations (such as those related to oversampling issues). This choice of parameter values subsequently affects the maximum input key imbalance among the processors which has a significant impact on running time. Insistence on one-optimality both in the design and analysis of the given algorithms is hoped to lead to efficient implementations that would confirm the claims of the theoretical analysis.
The theoretical analysis of the algorithms can be used in assessing the performance of the implementations because the designed algorithms have been shown to be one-optimal and this way, they have been directly compared to the best sequential implementations. In addition, in the theoretical analysis no hidden constants are involved with the most significant terms of parallel running time and the constants in low order terms are well understood. At the conclusion of the experiments theoretical performance is compared to observed performance on the test platform for this purpose.
In Section 4 we introduce some primitive operations that will be used in the sorting algorithms. In Section 5 we introduce the deterministic algorithm Sort Det BSP that will be implemented. The algorithm is one of several cases of a more general algorithm described in [28] . The performance of the implemented algorithm is however analyzed in detail and more accurately than the general algorithm of [28] , and experiments on its implementation are discussed in Section 6. Its augmentation to handle duplicate keys transparently and with optimal asymptotic efficiency is also discussed separately in Section 5.1.1. Subsequently, a special case of the randomized BSP algorithm in [21] is introduced, called Sort Ran BSP that is a sample-sort based algorithm. The randomized algorithm of our implementations, Sort IRan BSP, derives from Sort Det BSP and Sort Ran BSP and is introduced in the same section; it works differently however from traditional sample-sort based algorithms. Duplicate-key handling is also implemented transparently for Sort IRan BSP using the technique of Section 5.1.1 used for Sort Det BSP. For each of the two implemented algorithms, two variants are studied experimentally depending on how sequential sorting is performed. Finally the obtained experimental results are discussed in detail in Section 6 and compared to other parallel sorting implementations like those in [44, 39, 40, 41] .
The results we obtained justify the belief that architecture independent parallel algorithm design and analysis is possible, plausible, reliable and consistent and can be used to model and predict the performance of parallel algorithms on a variety of parallel machines with satisfactory accuracy. The vehicle for such an architecture independent parallel algorithm design and analysis has been the BSP model of computation. Its parameters seem to model a parallel computer well enough to make parallel program performance estimation plausible for such diverse problems as matrix computations ( [34] ) and parallel sorting. The experiences related to the latter problem are described in more detail in this work.
Primitive Operations
In this section BSP algorithms for two fundamental operations, namely broadcast and parallel-prefix and parallel radix-sort are introduced. All these primitives are auxiliary routines for the algorithms described in later sections. A fully detailed account of such operations along with additional results can be found in [26] .
Lemma 4.1 There exists a BSP algorithm for broadcasting an n-word message that requires time at most M n brd (p) = (⌈n/⌈n/h⌉⌉+h−1) max {L, gt⌈n/h⌉}, for any integer 2 ≤ t ≤ p, where h = ⌈log t ((t − 1)p + 1)⌉− 1.
Proof:
The underlying algorithm employs a pipelined t-ary tree, consisting of p nodes, for some appropriate t. The depth of the tree is h = ⌈log t ((t − 1)p + 1)⌉ − 1. In each superstep, the root processor of a subtree sends t copies of a separate m-word segment of the original message to its children, where m = ⌈n/h⌉. Similarly, each internal processor sends t copies of the message that it received in the previous superstep to its own children. The algorithm completes within at most ⌈n/m⌉ + h − 1 supersteps. Each superstep requires gtm communication time, and the lemma follows. Lemma 4.2 There exists a BSP algorithm for computing n independent parallel-prefix operations that requires time at most C n ppf (p) = 2(⌈n/⌈n/h⌉⌉ + h − 1) max {L, t⌈n/h⌉} and M n ppf (p) = 2(⌈n/⌈n/h⌉⌉ + h − 1) max {L, g2t⌈n/h⌉}, for computation and communication respectively, for any integer 2 ≤ t ≤ p, where h = ⌈log t p⌉, for a total time of T n ppf (p) = C n ppf (p) + M n ppf (p).
The underlying algorithm consists of two passes of the algorithm implied by Lemma 4.1, on a pipelined t-ary tree, consisting of p leaf nodes and at most p internal nodes, for some appropriate t. The depth of the tree is h = ⌈log t p⌉. The lemma follows by way of arguments similar to that of Lemma 4.1.
The same notational convention applies to all pipelined operations.
5 The Algorithms 5.1 BSP Deterministic Sorting Algorithm in [22, 28] The deterministic algorithm of the implementations is based on a non-iterative variant of the sorting algorithm of [22, 27, 28] which has been shown to be one-optimal for a satisfactory range of the BSP parameters that includes most currently built parallel machines; for a wider range of these parameters the algorithm is c-optimal, where c ≥ 1 is a small constant. The algorithm is regular-sampling based ( [61] ) but extends regular sampling to regular oversampling and utilizes an efficient partitioning scheme that splits -almost evenly and independently of the input distribution -an arbitrary number of sorted sequences among the processors. In Section 5.1.1 this base algorithm is augmented to handle transparently and in optimal asymptotic efficiency duplicate keys. In our approach duplicate handling does not require doubling of communication and/or computation time that other approaches seem to require [39, 40, 41] . The base algorithm consists of the following phases:
(1) Local Sorting. Each processor, in parallel with all the other processors, sorts its local input sequence of size n/p that resides in its local memory.
(2) Partitioning. Processors cooperate to evenly split the sorted sequences among the processors.
(3) Merging. Each processor, in parallel with all the other processors, merges a small number of subsequences of total size (1 + o(1))(n/p).
The iterative version of the algorithm performs m iterations of phases 2 and 3. In each iteration the data set is partitioned into k buckets of approximately equal size. In the following iteration a similar process of sub-partitioning each of these k buckets into k further buckets is performed, and so on. The maximum number of keys per processor in any of the m iterations can be shown to be (1 + O(1/ω n ))(n/p). By employing multi-way merging [49] the desired sorted sequence can be obtained.
In Figure 1 , function Sort Det BSP(X), implements the sorting operation for m = 1. X denotes the input sequence, n the size of X, p the number of processors, and s is a user defined parameter (oversampling factor) that inversely relates to the maximum possible imbalance of the sequences formed in step (13) of the algorithm. For any sequence X, the subsequence of X residing in processor k is denoted X k .
The implemented version Sort Det BSP as a special case of the more general algorithm reported in [22, 27, 28] can also be viewed as an adaptation of the regular-sampling algorithm of [61] on the BSP model. It differs, however, from the algorithm in [61] in that sample-sorting is performed in parallel and deterministic oversampling is used with the purpose of achieving finer load balancing in communication.
The choices in deterministic oversampling that also affect load-balancing in the subsequent key routing step are based on the quantifiable results obtained from the detailed analysis of [28] ; although oversampling has been used previously in randomized sorting, its usefulness in deterministic sorting was not explored and quantified in full.
form locally a sample Y k = y 1 , . . . , y rp−1 of rp − 1 evenly spaced keys that partition X k into s = rp evenly sized segments and append the maximum of X k to this sequence ; 5. letȲ = Bitonic Sort(Y ) ; 6. form S = s s , . . . , s (p−1)s fromȲ that consistsof p − 1 evenly spaced splitters that partitionȲ into p evenly sized segments ; The proposition and proof below simplify the results shown in a more general context in [22, 27, 28] .
Proposition 5.1 For any n and p ≤ n, and any function ω n of n such that ω n = Ω(1), ω n = O(lg n) and p 2 ω 2 n ≤ n/ lg n, and for n max = (1 + 1/⌈ω n ⌉)n/p + ⌈ω n ⌉p, algorithm Sort Det BSP requires time,
Corollary 5.1 For n, p and ω n as in Proposition 5.1, algorithm Sort Det BSP is such that
Proof: The input is assumed to be evenly but otherwise arbitrarily distributed among the p processors before the beginning of the execution of the algorithm. Moreover, the keys are distinct since in an extreme case, we can always make them so by, for example, appending to them the code for their memory location. We later explain how we handle duplicate keys without doubling (in the worst case) the number of comparisons performed. Parameter ω n determines the desired upper bound in processor key imbalance during the key routing operation. The term 1 + 1/⌈ω n ⌉ is also referred to as bucket expansion in sample-sort based randomized sorting algorithms ( [12] ). In step 3, each processor sorts the keys in its possession. As each processor holds at most ⌈n/p⌉ keys, this step requires time ⌈n/p⌉ lg ⌈n/p⌉. Algorithm Sort Seq is any sequential sorting algorithm of such performance.
Subsequently, each processor selects locally ⌈ω n ⌉p−1 = rp−1 evenly spaced sample keys, that partition its input into ⌈ω n ⌉p evenly sized segments. Additionally, each processor appends to this sorted list the largest key in its input. Let s = ⌈ω n ⌉p = rp be the size of the so identified list. Therefore step 4 requires time O(s).
The p sorted lists, each consisting of s sample keys, are merged; let sequence s 1 , s 2 , . . . , s ps be the result of the merge operation. By assumption, the sequence is evenly distributed among the p processors, i.e., subsequence s is+1 , . . . , s (i+1)s , 0 ≤ i ≤ p − 1, resides in the local memory of the i-th processor. The cost of step 5 is that of parallel sorting by one of Batcher's methods [5] , appropriately modified [49] to handle sorted sequences of size s. The computation and communication time required for this stage is, respectively, 2s(lg 2 p + lg p)/2 and (lg
In step 6, a set of evenly spaced splitters is formed from the sorted sample. A broadcast operation is initiated in step 7, where splitter s is , 1 ≤ i < p, along with its index in the sequence of sample keys is sent to all processors. Lines 6 and 7 require time O(1) and max {L, gO(p)} + M p−1 brd (p) for computation and communication respectively.
In step 9, each processor decides the position of every key it holds with respect to the p − 1 splitters it received in step 7, by way of sequential merging the splitters with the input keys in p − 1 + n/p time or alternately by performing a binary search of the splitters into the sorted keys in time p lg (n/p), and subsequently counts the number of keys that fall into each of the p so identified buckets induced by the p−1 splitters. Subsequently, p independent parallel prefix operations are initiated (one for each subsequence) to determine how to split the keys of each bucket as evenly as possible among the processors using the information collected in the merging operation. The p disjoint parallel prefix operations in step 9 are realized by employing the algorithm of Lemma 4.2 thus resulting in a time bound of p lg (n/p) + T p ppf (p) for step 9.
In step 11, each processor uses the information collected by the parallel prefix operation to perform the routing in such a way that the initial ordering of the keys is preserved (i.e. keys received from processor i are stored before those received from j, i < j, and also the ordering within i and j is also preserved).
Step 11 takes time max {L, gn max }.
In step 12, each processor merges the at most p sorted subsequences that it received in step 11. When this step is executed, each processor, by way of Lemma 5.1 to be shown, possesses at most p = min {p, n max } sorted sequences for a total of at most n max keys, where n max = (1 + 1/⌈ω n ⌉)(n/p) + ⌈ω n ⌉p. The cost of this stage is that of sequential multi-way merging n max keys by some deterministic algorithm [49] , which is n max lg p, as ω 2 n p = O(n/p).
Summing up all the terms for computation and communication and noting the conditions on L and g and assigning a cost of n lg n to the best sequential algorithm for sorting the result follows.
It remains to prove that at the completion of step 9 the input keys are partitioned into (almost) evenly sized subsequences. The main result is summarized in the following lemma.
The maximum number of keys n max per processor in Sort Det BSP is (1 + 1/⌈ω n ⌉)(n/p) + ⌈ω n ⌉p, for any ω n such that ω n = Ω(1) and ω n = O(lg n), provided that ω 2 n p = O(n/p) is also satisfied.
Proof: Although it is not explicitly mentioned in the description of algorithm Sort Det BSP we assume that we initially pad the input so that each processor owns exactly ⌈n/p⌉ keys. At most one key is added to each processor (the maximum key can be such a choice). Before performing the sample selection operation, we also pad the input so that afterwards, all segments have the same number of keys that is, x = ⌈⌈n/p⌉/s⌉. The padding operation requires time at most O(s), which is within the lower order terms of the analysis of Proposition 5.1, and therefore, does not affect the asymptotic complexity of the algorithm. We note that padding operations introduce duplicate keys; a discussion of duplicate handling follows this proof.
Consider an arbitrary splitter s is , where 1 ≤ i < p. There are at least isx keys which are not larger than s is , since there are is segments each of size x whose keys are not larger than s is . Likewise, there are at least (ps − is − p + 1)x keys which are not smaller than s is , since there are ps − is − p + 1 segments each of size x whose keys are not smaller than s is . Thus, by noting that the total number of keys has been increased (by way of padding operations) from n to psx, the number of keys b i that are smaller than s is is bounded as follows.
isx
A similar bound can be obtained for b i+1 . Substituting s = ⌈ω n ⌉p we therefore conclude the following.
The difference n i = b i+1 − b i is independent of i and gives the maximum number of keys per split sequence.
Considering that x ≤ (n + ps)/(ps) and substituting s = ⌈ω n ⌉p, the following bound is derived.
By substituting in the nominator of the previous expression s = ⌈ω n ⌉p, we conclude that the maximum number of keys n max per processor of function Sort Det BSP is bounded above as follows.
The lemma follows.
The particular algorithm as depicted in Figure 1 corresponds to the simplest case of the deterministic algorithm in [28] where an analysis for all possible values of p is presented. In the general algorithm, the number of communication rounds is lg n/ lg (n/p) for any p = n 1−t , with t constant, i.e. it is constant. If, however, t is not constant, then the number of rounds becomes O(lg n/ lg (n/p)) still matching in all cases the lower bound of [36] . We note that in the general case, the implementation of parallel prefix and broadcasting operations depends on p, L, g and the amount of information processed by these operations. This highlights a difference between architecture independent parallel algorithm design and classic parallel algorithm design. For a given choice of problem size n, and machine i.e. for a given tuple (n, p, L, g), the resulting algorithm may differ from that chosen for another tuple (n ′ , p ′ , L ′ , g ′ ). In the case of sorting for example, one algorithm may implement sample sorting sequentially while another one in parallel, one algorithm may implement a parallel prefix or broadcasting operation using a PRAM approach in lg p supersteps while another algorithm may perform the same operations in constant number of supersteps as in Lemma 4.1 or 4.2.
Duplicate-key Handling
Algorithm Sort Det BSP, as described, does not handle duplicate keys. A naive way to handle duplicate keys is by making the keys distinct. This could be achieved by attaching to each key the address of the memory location it is stored in. For data types whose bit or byte length is comparable to the length of the address describing them, such a transformation leads -in most cases -to a doubling of the overall number of comparisons performed and the communication time in the worst case. For more complex data types such as strings of characters the extra cost may be negligible.
An alternative way to handle duplicate keys is the following one that was also used in the implementations and handles duplicate keys in a transparent way that provides asymptotic optimal efficiency and tags only a small fraction of the keys. This seems to be an improvement over other approaches [39, 40, 41] that require a doubling of communication time. Procedure Sort Seq is implemented by means of a stable sequential sorting algorithm. Two tags for each input key are already implicitly available by default, and no extra memory is required to access them. These are the processor identifier that stores a particular input key and the index of the key in the local array that stores it. No additional space is required for the maintenance of this tagging. In our duplicate-key handling method such tags are only used for sample and splitter-related activity.
As sample sorting is a global operation, for sample sorting and splitter distribution only we augment every sample key into a record that includes this additional tag information (array index and processor storing the key). As the additional tagging information affects the sample only, and the sample is o(1) of the input keys, the memory overhead incurred is small, as is the computational overhead. The attached tag information is used in step 4 to form the sample, in step 5 for sample sorting, in steps 6 and 7 for splitter selection and broadcasting, and finally in step 9 as all these steps require distinct keys to achieve stability and load-balance. In step 9 in particular, a binary search operation of a splitter key into the locally sorted keys involves first a comparison of the two keys. If the keys are equal the result of the comparison is resolved by comparing the readily (and implicitly) available identifier of the processor that holds the local key to the processor storing the comparing splitter (available through the tagging in step 4, and the broadcasting of step 7). If the two processor identifiers are equal, then the result of the comparison is determined by comparing the indices of the position in the local array that stores the local key and the splitter being compared.
In addition, the merging operation must also be performed in a stable manner, that is if, the keys at the head of two sorted sequences are equal the one received from processor i is appears before the one received from processor j, i < j in the output sequence of the operation.
The computation and communication overhead of duplicate handling that is described by this method is within the lower order terms of the analysis and therefore, the optimality claims still hold unchanged. The results on key imbalance still hold as well. This same duplicate handling method is also used in the implementation of algorithm Sort IRan BSP.
BSP Randomized Sorting Algorithm in [21]
Randomized BSP sorting was introduced in [21] . An architecture independent analysis of the algorithm in [21] in terms of problem size n and p, L and g shows that it is one-optimal for most cases of interest. The algorithm derives from quicksort, the ideas of [20, 60] and the technique of oversampling [59] . It achieves the claimed efficiency as follows.
(1). The algorithm satisfies the requirement of one-optimality by using oversampling. The oversampling factor is in general ω(lg n) and a practical choice that is being used in the experiments is Θ(lg 2 n).
(2). As the size of the sample is smaller than input size n and parallel sampling is used, sample sorting can be performed either sequentially (by sending the sample to a single processor and sorting locally) or, as noted in [21] , recursively, or by employing a non-optimal but in practice faster parallel algorithm such as Batcher's bitonic or odd-even merge sort as any of the latter two is simpler to implement and incurs low overhead costs for small problem instances.
(3). At every recursive call of classic quicksort an input sequence is split into two subsequences; a naive parallel implementation of quick sort would require in the best case lg n communication rounds
(4). Based on ideas of [20] it then makes sense to split the input sequence into k sets, where k > 2, by choosing k − 1 splitters at every phase. This way the number of communication rounds is reduced to lg p/ lg k. For p = n 1−t as before, by choosing k so that k = p t , the number of communication rounds is then lg n/ lg (n/p). For constant t, this is constant and therefore, communication time is smaller (O(gn/p)) than before. This observation is used in the randomized algorithm of [21] that describes the various cases of interest: (a) p ≤ √ n that is of practical interest, (b) p = n 1−t , where t is constant and the algorithm is still interesting in terms of its practical implications and (c) for all other cases, p can grow as large as p = O(n/ lg 1+a n), for any constant a > 0 and one-optimality can still be maintained. For most practical applications the number m of communication rounds is one (case (a)) or in some extreme cases at most 2 (case (b)). Which of the algorithms in (a) or (b) applies depends on the values of n, p, L and g. Although partitioning and oversampling in the context of sorting are well established techniques, the analysis in [21] summarized in Claim 5.1 below allows one to prove the one-optimality of the sorting algorithm by quantifying precisely the key imbalance among the processors. Let X = x 1 , x 2 , . . . , x N be an ordered sequence of keys indexed such that x i < x i+1 , for all 1 ≤ i ≤ N − 1. The implicit assumption is that keys are unique. Let Y = {y 1 , y 2 , . . . , y ks−1 } be a randomly chosen subset of ks − 1 ≤ N keys of X also indexed such that y i < y i+1 , for all 1 ≤ i ≤ ks − 2, for some positive integers k and s. Having randomly selected set Y , a partitioning of X − Y into k subsets, X 0 , X 1 , . . . , X k−1 takes place. The following result shown in [21] is independent of the distribution of the input keys. −1)) ) .
Then the probability that any one of the X i , for all i, 0 ≤ i ≤ k−1, is of size more than ⌈(1+ε)(N −k+1)/k⌉ is at most n −ρ .
Algorithm Sort Ran BSP describes case (4)(a) of [21] , and this special case is widely referred to as sample-sort in the literature. X denotes the input key sequence, n the size of X, p the number of processors, and s is a user defined parameter (oversampling factor) that inversely relates to the maximum key imbalance of the split sequences formed in step (13) of the algorithm. form locally a set S = s 1 , . . . , s p−1 of p − 1 evenly spaced splitters that partitionȲ into p evenly sized segments ; 7.
Broadcast(S) ; 8. for each processor k , k ∈ {0, . . . , p − 1}, in parallel 9.
do partition X k into p subsets X k 0 , . . . , X k p−1 as induced by the p − 1 splitters of S ; 10.
for all i ∈ {0, . . . , p − 1} 11. Proposition 5.2 For any ω n , n, p such that 2pω 2 n lg p < n/2, p 2 ≤ n, and
Proof: For the sake of completeness we use elements of the proof of [21] to illustrate the performance of the algorithm for the specific choice of splitters k = p − 1.
Step 2 of the algorithm requires parallel time O(L + gsp) per processor. As shown in [21] , processors select uniformly at random processor identifiers 0 . . . p − 1 and send these identifiers to the identified processors. Chernoff bound techniques can be used to show that each processor sends or receives O(s) identifiers for communication time O(L + gs). Processors then select uniformly at random and without replacement a number of keys equal to the number of identifiers received previously, a step that takes O(s) parallel time. Since all sample keys are then communicated to processor 0 this step would require O(L + gsp) time.
Step 5 requires time sp lg (sp) and splitter selection in step 6 takes O(p) time.
The broadcasting in step 7 takes O(L + gp) time as it will take place in one superstep. Each processor k then performs a binary search of its set X k of n/p input keys onto the p − 1 splitters to determine sets X k 0 . . . X k p−1 . This requires at most (n/p)(lg p + 1) comparisons. By claim 5.1, with probability 1 − o(1), each ofX k is of size at most (1 + 1/ω n )n/p. Therefore, in step 11, each processor sends n/p and receives at most (1 + 1/ω n )n/p keys for communication time O(L + g(1 + 1/ω n )n/p) and computation time in step 12 of (1 + 1/ω n )n/p lg (n/p) + o(n/p) for local sorting, as ln (1 + x) ≤ x, x < 1.
The total parallel computation time of Sort Ran BSP is (1 + 1/ω n )n lg n/p + (n/p) + sp lg (sp) + o(n/p) + O(L + p) and communication time is O(gn/p + gps + L). If we compare the parallel algorithm to the best sequential algorithm that requires n lg n comparisons and noting that L = o(n/(p lg 2 p)), the claimed bounds on π and µ are derived.
Sort Ran BSP maintains an oversampling factor 2ω 2 n lg n that is Ω(lg n).
An implementation of Sort Ran BSP of Figure 2 is straightforward except perhaps that of step 9. In step 9, set X k is split into p sets such that the keys of the i-th set are routed to processor i; the sets are determined by a binary search operation of each key into the set of p − 1 splitters. The formation of the p sets in step 9 is equivalent to an integer sort operation with key the result of the binary search operation (i.e. destination processor). Such set formation operation is thus of linear time Dn/p; constant D is significant however since it includes the cost of copying keys in memory so that keys with the same destination processor are stored in contiguous memory locations. For relatively small values of n, constant D has the same growth as lg n, and therefore, asymptotic claims may not be valid. In addition it seems that the sorting operation of step 5 could be done in parallel rather than sequentially.
These observations were taken into consideration in the design of the randomized BSP sorting algorithm of the implementations so that its performance is fully optimized. The resulting algorithm is Sort IRan BSP. It is this algorithm that was implemented rather than sample-sort Sort IRan BSP. Note that Sort IRan BSP differs from most traditional sample-sort randomized parallel sorting algorithms in that it follows the pattern of local sorting, sample and splitter selection, key routing and local merging rather than the traditional one of sample and splitter selection, key routing and local sorting that identifies traditional sample-sort based randomized parallel sorting algorithms. In particular, in the design and implementation of Sort IRan BSP we addressed the first limitation related to step 9 of Sort Ran BSP. To this end we employed ideas from the deterministic algorithm [22, 27] described in Sort Det BSP. In particular, we sorted the input keys before realizing the communication in step (11) and before we performed the sampling operation. This requires the replacement of the sequential sorting algorithm in step (12) of Sort Ran BSP by a multi-way merging algorithm as the received sequences from at most p − 1 processors are already sorted. A binary search in step (9) is not required any more as we can merge the local sorted keys with the sorted splitters or perform a binary search of the splitters into the keys an operation that allows for coarse-grained communication in step (11) . The resulting algorithm looks similar to Sort Det BSP and thus, attains performance at least comparable to that of the corresponding deterministic algorithm. Random sampling, however, gives the programmer more freedom to determine processor imbalance (ω n in the deterministic case is O(lg n); there is no such limitation in the randomized case). For handling duplicate keys we employed the method of the deterministic algorithm implementation described in Section 5.1.1 which effectively tags few keys only (sample keys) and thus uses asymptotically as much memory as the non-duplicate handling variant. Algorithm Sort IRan BSP is outlined in Figure  3 .
Compared to Sort Ran BSP, in Sort IRan BSP local sorting is performed first on a set of n/p keys, not (1 + 1/ω n )n/p. Binary search of the input keys into the splitters can be simplified by merging the two in linear time or performing a binary search of the splitters into the input key set, a less expensive operation; the latter operation was implemented in the code of our experiments. Communication is simpler, as by the initial sorting and the binary-search operation of the splitter keys into the sorted local input keys, each processor is able to communicate a contiguous block of keys to every other processor. Because each processor receives sorted sequences a multi-way merge operation at the conclusion of the algorithm is required for the same asymptotic cost in terms of comparisons performed to the binary search operation of the keys into the splitter that has become unnecessary.
Computation time of Sort IRan BSP is n/p lg (n/p)+(1+1/ω n )n lg p/p+2ω 2 n lg n lg 2 p+O(p lg (n/p)+ ω 2 n lg n lg p) and communication time is (1 + 1/ω n )ng/p + gω 2 n lg n lg 2 p + L lg 2 p/2 + O(L lg p + gω 2 n lg n lg p + pg). The first two terms of computation time are due to local sorting and multi-way merging respectively, begin Sort IRan BSP (X) 1. let s = 2ω 2 n lg n ; 2. for each processor k , k ∈ {0, . . . , p − 1}, in parallel 3.
do Sort Seq(X k ) ; 4.
select uniformly at random a sample Y = y 1 , . . . , y sp−1 of sp − 1 keys ; 5. letȲ = Bitonic Sort(Y ). 6. for each processor k , k ∈ {0, . . . , p − 1}, in parallel 7.
form set S = s 1 , . . . , s p−1 of p − 1 evenly spaced splitters that partitionȲ into p evenly sized segments ; 8.
communicate the splitters to processor 0. 9. Broadcast(S) ; 10. for each processor k , k ∈ {0, . . . , p − 1}, in parallel 11.
do partition X k into p subsequences X and the third term is due to parallel sample sorting. The first term of communication time is due to key routing, the second term is due to parallel sample sorting, and the third term reflects the number of synchronization operations required for parallel (Batcher-based) sample sorting. The terms hidden in the O(·) notation describe contributions of the remaining auxiliary operations. For example, in computation time the first term in O(·) reflects the cost of binary search of the splitters into the local keys, and the second term low order contributions of parallel (Batcher based) sample sorting. Similarly for communication time the first and seconds terms in O(·) reflect lower order contributions in parallel sample sorting and the third term the cost of splitter broadcasting. Therefore for p 2 ≤ n/(ω n lg n) and 2pω 2 n lg p < n/2, and L ≤ 2n/(p lg 2 p), we conclude that π = 1+lg p/(ω n lg n)+2pω 2 n lg 2 p/n+O(1/ω n lg n+ω 3/2 n lg p/ √ n lg n) and
n lg p/ √ n lg n + g/(ω n lg 2 n)).
Proposition 5.3 is then derived.
Proposition 5.3 For any ω n such that 2pω 2 n lg n < n/2, p 2 ≤ n/(ω n lg n) and L ≤ 2n/(p lg 2 p), algorithm
Performance Evaluation
As the BSP model is not just an abstract architectural model, it may also be employed as a programming platform or, indeed, as a kind of a programming paradigm. The underlying concept in the BSP model is the notion of the superstep and the abstraction that non-local communication associated with a superstep takes place between supersteps as a global operation. Thus, a BSP program may be viewed as a succession of supersteps, with the required non-local communication occurring at the end of each superstep. Viewed this way, the BSP model can be realized as a library of functions with architecture independent semantics for process creation, remote data access and bulk synchronization. The Oxford BSP Toolset, BSPlib, implements such a paradigm [43] and provides library support for BSP programming. The library functions are callable from standard imperative languages such as C and Fortran. It should be borne in mind that such support can be made available by other non BSP-specific libraries (e.g. MPI) that support the simple communication and synchronization primitives required for programming under the BSP model. The effort required to learn the basics of BSPlib is minimal. One needs to understand the semantics of no more than 10-15 functions; half of these functions are related to process creation, destruction and identification. The implementations presented in this work were programmed in ANSI C. Only recompilation is required to run the same code on other platforms such as Silicon Graphics Origin 2000 and Power Challenge systems, or an IBM SP2 system. Our implementations are test for scalability and portability on a distributed memory system, a 128-processor Cray T3D. The manufacturersupplied C compiler (cc) is used through the BSPlib front-end, and the source-code is compiled with the -O3 compiler option set. Timing is obtained through the use of the real-time (wall-clock time) clock function bsp time of BSPlib [37] . The timing results obtained are discussed later in this section. The depicted results in the following tables are in general, averages over at least four experiments. In the following discussion we shall require the BSP parameters of the machine configurations test. The CRAY T3D is thus reported to behave as a BSP machine with sets of parameters (16, 130µsec, 0.21µsec/int), (32, 175µsec , 0.26µsec/int), (64, 364µsec, 0.28µsec/int), (128, 762µsec, 0.34µsec/int), for the configuration used in the experiments (data type in communication is a 64-bit integer). Our implementation of quicksort, sorts 1024 × 1024 integer keys in about 3 seconds. This is equivalent to 7 comparisons per microsecond; expressing g in terms of basic computational operations (i.e. comparisons) a value of g of 0.21µsec/int becomes 0.21 × 7 ≈ 1.47 comparisons/int. The BSP approach in modeling the performance and running time characteristics of parallel algorithms and programs is by modeling some abstract features of the underlying parallel platform as expressed through the parameters p, L and g. This parametrization is not as pure in terms of measuring hardware performance as say that of the LogP model. In fact it may be considered as more flexible, accommodating and forgiving. Parallel performance prediction has also attempted to model high-level parallel and sequential operations [17] . The problem with such approaches as also reported in [17] is that computational operation modeling can be affected by other platform characteristics (e.g. caching) thus making the study of scalability issues difficult. The minimalist approach of the BSP model may lead to fewer such problems, in general.
As the objectives of algorithm design on the BSP model are scalability, portability and predictability of performance, the algorithms we suggest may not lead to the best possible implementation on a specific platform. The sequential methods used may not be the best possible. One may also improve performance (sequential and parallel) by directly using specific features and interfaces of each machine (e.g. communication primitives). Our aim is to show that all three objectives can be realized without significant loss of performance in a general, architecture independent way. This is similar to the portability/reusability one obtains by programming in a high-level language as opposed to assembly.
Algorithm Implementation Features
In the implementation of the BSP algorithms, the following choices have been made.
• The implemented parallel algorithms are comparison-based sorting algorithms even though the data type of the input keys is ANSI C int integers, and one of the two methods used for sequential sorting is an integer-specific method, radixsort. We use integer data keys for input as most other experimental studies also sort integer data and we would like to be able to compare our implementations to other ones. In addition, comparisons performed on integers are faster than those performed on strings of arbitrary length or non-trivial structures. This way we make sequential operations as fast as possible for the purpose of showing the parallel (communication) efficiency of our designs and implementations.
• Our algorithms naturally handle duplicate keys efficiently. The approach to handle duplicate keys is the one discussed in Section 5.1.1. We used it for both the deterministic and randomized algorithm implementations. As a result, the introduced algorithms are independent of input distribution. As it was discussed in Section 5.1.1, the memory overhead of handling duplicate keys is negligible, and may triple in the worst case the sample size as it attaches to each sample key an integer processor identifier and an integer array index; as sample size is o(1) of problem size such overhead is tolerable. The overhead of duplicate handling in computation and communication time is in general asymptotically negligible thus affecting only lower order terms of the running time (a 3-6% deterioration in performance was observed in most of the experiments compared to test cases where duplicate key handling was intentionally switched off). The effect of duplicate key handling becomes non-negligible when sorting 1M integers (relatively few keys) on 128 (many) processors. Our approach is in contrast to the methods of [39, 40] that require twice as much communication to accommodate duplicate keys.
• Given the scarcity of experimental study results that can be used for comparison purposes and the fact that experimental sorting studies (e.g. [39, 40] ) with integer input data use radixsort for sequential sorting, our implementations and experimental studies use two different algorithms for sequential sorting. One of the sequential sorting algorithms is a purely comparison-based algorithm, an author-written version of quicksort, and another one is an author-written integer specific version of radixsort. Radixsort was implemented purely for the purpose of comparing our implementations with other comparison-based parallel algorithms (such as the ones in [39, 40, 41] ) that nevertheless use radixsort for sequential sorting of integers.
• Previous experimental results [31] describe generic implementations of our algorithms on the test and other platforms. A generic implementation is one that can generically sort any data type. This is achieved by including in the parameters of the parallel sorting function an additional argument, which is a function called compare, that compares two instances of the data type that is used as input for sorting. Standard C library function qsort is such an example of a generic sorting function. A function like compare returns -1, 0, or +1 depending on whether the former of its arguments is less, equal or greater than the latter one. The performance of a generic sorting algorithm that performs O(n lg n) comparisons to sort n keys is then bound by the O(n lg n) number of function calls to compare. As a consequence, a generic sorting function becomes 4-7 times slower than the corresponding non-generic one as it is also evident in the comparison of [39] ( Table X) that compares a non-generic to a generic implementation. As all other experimental parallel sorting studies use non-generic implementations, we decided to follow their approach in this study. For a non-generic implementation however, one needs to create a new set of functions for each data type used. We therefore decided to implement non-generic functions of the otherwise originally generic code that operate on integers.
• Total sample size over all the processors for the deterministic algorithm is p 2 ⌈ω n ⌉, where ω n = lg lg n. For the randomized algorithm it is 2pω 2 n lg n, where ω 2 n = lg n.
Algorithm implementations
The following BSP sorting algorithms have been implemented on top of BSPlib.
(1) Two variants of the one-optimal deterministic algorithm Sort Det BSP that both operate on ANSI C int data types, with duplicate key handling performed according to the method described in Section 5.1.1. One uses for sequential sorting, quicksort (an author written implementation) and is called [DSQ] , and the other uses radixsort and called [DSR].
(2) Two variants of the one-optimal randomized algorithm Sort IRan BSP, with duplicate key handling performed according to the method described in Section 5. (3) A version of Batcher's bitonic sort [5] has been implemented and is called [BSI] .
[BSI] is used for parallel sample sorting only. As its performance is worse than that of any of the other four implementations in all but very small problem and processor sizes (for such cases, Batcher's algorithm is faster because of its low overhead) we do not compare its performance to our other implementations.
Remark 1. The implementations can handle duplicate keys as this was previously explained.
Remark 2. The implementations are portable and reusable enough to allow any sequential sorting or merging algorithm to be used as the underlying sequential method.
Sorting Benchmarks
We have test our implementations on a variety of input sets. We tried to conform to previously published data sets [39, 40, 41] by including them in this experimental study. For a discussion of the practicality and suitability of this set of sorting benchmarks we refer to [41, 40, 39] . The sorting benchmarks are briefly defined and INT MAX is the maximum integer value plus one accommodated in a 32-bit signed arithmetic data type (e.g., 2 31 ). The definitions below follow those appearing in [41] , except for the worst regular input set that follows the definition in [39, 40] . The format of the tables reporting the results is similar to the ones of [39, 40, 41] . In addition, we decided not to test our implementation on two additional data sets of [39, 40] (1) Uniform [U], the input is uniformly and at random distributed, and is generated by calling a pseudo random number generator, the C standard library function random(), which returns a long (integer) in the range [0, . . . , 2 31 − 1] and processor's i seed is 21 + 1001 · i.
(2) Gaussian [G], the input follows a Gaussian distribution, and is approximated by adding the results of four calls to random() and dividing the sum by four. 
(6) Deterministic Duplicates [DD], the n/2 i input keys of the first p/2 i consecutive processors are all set to lg (n/p i−1 ), and so forth. In processor p − 1, n/(p2 i ) keys, starting from the lower indexed and proceeding to higher indexed, are set to lg n/(p2 i−1 ) and so forth.
(7) Worst Regular [WR], as described in [39] .
Experimental Results
In this section, we report on the performance and relative merits of the implementations of the algorithms of Section 5. The tables below summarize some of the experimental results we had obtained. Timing figures are in general truncated to three or fewer decimal digits. Table 1 shows timing results for algorithm Sort IRan BSP on 64 processors of a Cray T3D. Results for both [RSR] and [RSQ] are presented for all seven benchmark input sets. Size refers to the total size n of the input over all 64 processors.
It is possible to compare our implementation of Sort IRan BSP to the one described and implemented in [41] (Table I , page 215). For problem size 1M, ours is slower by about 3-10%. For the other common problem sizes (4M, 16M, 64M) our implementation is faster by about 3-8%. It is worth noting that for small problem sizes (e.g. 1M) the use of quicksort for sequential sorting improves the parallel performance of our implementation by 1-2%.
Compared to the implementation of the randomized sorting algorithm introduced in [40] , our implementation is slower by 3-15% for problem size 1M, 3-5% for problem size 4M, and about 1-2% for the remaining problem sizes. The slowness of our implementation is mainly attributable to sequential merging which takes 33-39% of the total execution time of any one experiment as opposed to only 25% for [40] . Sequential operations (sorting and merging related) take together 90% of execution time in our implementation and 80% in the implementation of [40] . Although our implementation is more communication efficient, mainly because it uses only one round of data communication, this advantage is wasted by the apparent inefficiency of a sequential operation (merging).
The experimental results obtained for all seven benchmark sets for algorithm Sort Det BSP are depicted in Table 2 . A comparison of our implementation to the one in [41] (Table IV , page 218) indicates that ours is faster by as much as 13-35% for problem size 1M, 15-25% for problem size 4M, 12-15% for 16M, and 7-15% for 64M. This is possibly attributable to the fact that the algorithm in [41] is a direct variant of the one in [61] . Our implementation on the other hand, although it is also based on ideas of [61] , it naturally handles duplicate keys (without any significant increase of communication time), implements deterministic oversampling and determines sample size on the basis of maximum key imbalance at the completion of sorting. Its design and analysis is solely based on the BSP model and seems to perform better in practice exhibiting performance that is within the predictions of the theoretical analysis. The algorithm of [41] was further refined into an oversampling-based algorithm that takes into account duplicate keys and is described in [39] . This latter algorithm also handles duplicate keys by performing twice as much communication. Comparing the performance of that latter algorithm (Table III of [39] ) to our results as obtained in Table 2 we observe that for problem size 1M our implementation is faster by as much as 10-20%, and for problem sizes, 4M, 16M and 64M our algorithm is faster by 8-26%, 3-10% and 2-3%. In our implementation sorting requires one communication round instead of two in [39] . Even though our sequential methods are slower (as exhibited in Table 4 ) than the ones implemented in [39] (Table IX) , the extra communication round of the algorithm in [39] neutralizes such an advantage even though the cost of a communication round is no more than 5-6% of the overall time of the algorithm as opposed to sequential sorting and merging that account for 45-60% and 30-40% of the running time respectively.
[RSR]
[RSQ] . We used these two input distributions as they have also been used in the experimental work of [39, 40, 41] ; in those studies, for a deterministic sorting algorithm, [WR] is used exclusively [39, 41] , whereas for a randomized sorting algorithm both [WR] [41] and [U] [40] are used. A scalability comparison of all these algorithms is depicted in Table 9 later in this work. With respect to Table 3 it is worth noting that for relatively small ratios n/p (e.g. when p = 128 and for moderate problem sizes), sequential sorting of integers using quicksort may yield better performance than using radixsort, as expected. As the variants of our algorithms that use quicksort for sequential sorting are more CPU intensive, efficiencies are higher for such cases than those that use radixsort. For p = 128, [RSR] has the same efficiency 64-65% independent of the input data distribution; for [RSQ] this varies between 78% to 83%. The efficiencies of the deterministic algorithm are lower however; they are 53% and vary between 63% -67%, i.e. they are between 12-15% lower than those of the randomized algorithm. This is due to the fact that for a fixed size sample, random sampling yields more balanced set of keys in the routing and multi-way merging phases of the sorting algorithm. We note that for the deterministic algorithm we chose ω n = lg lg n and for the randomized algorithm ω 2 n = lg n. In all runs of our experiments (of either the deterministic or the randomized algorithm) maximum set imbalance was kept below 15% well within the approximately 20% obtained from the theoretical analysis for the deterministic (1/⌈lg lg n⌉ × 100%) and randomized algorithms (1/ √ lg n × 100%). For problem size n = 2 23 = 8M as used in Table 3 and for p = 128, the conditions on n, p and L in Proposition 5.1 and 5.3 are hardly satisfied. Substituting for n, p, L and g in the expressions for π and µ in Proposition 5.1 and ignoring the low order terms (enclosed in O(.)) we derive a theoretical bound on efficiency of at least 66% for [DSQ] . The observed efficiency was 63% and 67% for [U] and [WR] respectively. We note that the contribution to the theoretical estimation of running time of handling duplicate keys was ignored (such contributions would have been absorbed in the O(.) term which is anyway ignored in the calculation of the theoretical efficiency). Had we disabled the code for handling duplicate keys, for p = 128 running time for [DSQ] on input [U] would be 0.348 (i.e. efficiency 70% instead of 63%) and that for [DSR] on input [U] would be 0.336 (i.e. efficiency 58% instead of 53%). For the randomized algorithm the theoretical prediction of at least 66% was also satisfied in practice (observed efficiency was 78-82% ). We note that the theoretical analysis applies only to the case that sequential sorting is performed by a comparison and exchange algorithm only. Although the execution time of radixsort is independent of the input distribution (if we ignore caching effects), that of quicksort is not. Tables 4 and 5 It is worth observing that routing time of Ph5 scales satisfactorily with the number of processors, an indication that the randomized algorithm is quite scalable. In the implementation, ω n = √ lg n, and therefore in Ph6 each processor is expected to send or receive no more than (1 + 1/ω n )n/p keys. The observed imbalance on any of the experiments reported in Tables 4 and 5 Tables 6 and 7 show the distribution of execution time for sorting 8M and 32M integers by the deterministic algorithm when input is [U]. Percentages of total running time for the execution time of each phase are also shown. With reference to Figure 1 , Ph1 corresponds to the part of the code before line 2, Ph2 (sequential sorting) to the execution of lines 2-3, Ph3 to lines 4-7, Ph4 to lines 8-9, Ph5 (key routing) to lines 10-11, and Ph6 (sequential merging) to lines 12-13 of the code for Sort Det BSP. Communication performance is scalable though slightly worse than those of the randomized algorithm as random oversampling causes more balanced communication.
From Tables 4 and 6 and Tables 5 and 7 , the sequential portion of the code contributes at least 86-93% of the running time. As the two algorithms share most operations except sampling, the performance of the two algorithms in the first few phases is similar. The quality of sampling affects Phases 5 and 6. This is evident for example from the figures in the four tables for n = 8M and p = 128. The randomized algorithm is about 0.033 seconds faster than the deterministic algorithm and this is solely due to key-imbalance and the better balancing properties of the randomized algorithm whose oversampling parameter can vary more widely than that of the deterministic algorithm. Table 8 ), even though the portable communication library of our experiments may be less efficient than the one of [39, 40, 41] . This may be attributable to sampling choices or the way duplicate keys are handled in [39] .
In Table 9 the first four rows compare the performance of [RSR] to the ones in [40, 41] for problem size n = 8M . Performance of the compared algorithms is similar for up to p = 64 processors except for p = 128 where our implementation is slightly worse by about 5% attributable mainly to some inefficiency in sequential merging. The following three rows compare the performance of [DSR] to the deterministic regular sampling algorithms in [39, 41] .
[DSR] is more efficient than the algorithm in [41] through all processor ranges by about 15-30%. Compared to the implementation in [39] the two exhibit about the same performance for up to p = 32 (the disadvantage of our sequential merging is counterbalanced by the two rounds of communication of [40] ), and for p = 64 and p = 128 [DSR] is better by 10% and 30% Table 10 depicts the scalability of the four sorting variants introduced in this work with increasing processor sizes for three moderate inputs sizes. As mentioned earlier for p = 128 and n = 8M , the asymptotic claims for Sort IRan BSP and Sort Det BSP hardly hold, and this is also evidenced by the deteriorating performance of the implementations for n = 4M and n = 1M . The slowdown for n = 1M is also attributable to the effect of the extra code for duplicate handling; disabling of this piece of code results in the elimination of the slowdown observed for n = 1M ; the obtained speedup is however minimal. We note that the experiments reported in Table 10 constitute a separate set of experiments in addition to those performed for the creation of Table 9, Table 1, and Table 2 ; this explains slight differences in running times which for almost all cases affect the third decimal digit of various reported timing entries.
Another work that includes experimental results on deterministic sorting appears in [44] . Table 11 below compares [DSQ] to a direct implementation of the regular sampling algorithm of [61] reported in [44] . The implementation of [44] is similar to the deterministic algorithm of [41] and the performance of the two algorithms is similar and significantly worse than the more refined algorithm Sort Det BSP ([DSQ]) or the one in [39] . In addition, the algorithm in [44] as well as the algorithm in [41] can not handle duplicate keys. 
Conclusion
In this work we have introduced deterministic and randomized algorithms for internal memory sorting that were designed and their performance analyzed in an architecture independent way under the BSP model. The deterministic algorithm [28] is an improvement of the regular sampling algorithm of [61] and uses the technique of deterministic oversampling (used for randomized sorting in [59] ) and in the general case, its theoretical performance is fully analyzed in [28] . Some other of its advantages are its parallel sample-sort that allows p to be much close to n than the algorithm in [61] . From our experimental observations it seems that parallel sample-sort is effective even for small problem sizes, even if other parallel sorting algorithms and implementations do not use it. In addition, the introduced in this work transparent and efficient duplicate handling method of Section 5.1.1 allows the algorithm to handle duplicate keys without doubling the computation load or the communication time that other approaches seem to require [39, 40, 41] . The randomized algorithm although oversample-sort based works differently from other sample-sort based approaches and even improves upon the performance of the first randomized BSP sorting algorithm of [21] ; it achieves this by using ideas derived from the deterministic algorithm. It also handles duplicate keys with insignificant degradation in performance (computation or communication). Given the insistence under the BSP model of one-optimality and the accurate accounting of key imbalance during sorting, it is easy to finetune the work-load balance and communication time by adjusting the oversampling parameter and sample size of any of the two algorithms. The experimental results presented in this work verify the theoretical claims on the efficiency of the introduced algorithms. We have compared our implementations to other parallel sorting implementations on the same or similar platforms and realized that in terms of parallel communication efficiency our algorithms seem to outperform other implementations and overall exhibit comparable if not better performance (e.g. deterministic algorithm) even though the sequential algorithms used in our implementations seem, in many instances, to be slower than those of other implementations. This suggests that one can probably improve overall performance of our implementations if sequential methods are optimized given the fact that in our communication efficient designs sequential code takes 80-90% of execution time for the problem instances examined. As far as communication is concerned there is probably no room for significant improvement unless one tries to optimize the single communication round by utilizing platform specific communication primitives at the expense of portability. The experimental portion of this work was performed while both authors were with the Oxford University Computing Laboratory, The University of Oxford, Oxford, UK. The work of the first author there was supported in part by EPSRC under grant GR/K16999 and that of the second author was supported in part by a Bodossaki Foundation Graduate Scholarship. The support of the Edinburgh Parallel Computing Centre in granting the first author access to a Cray T3D computer is gratefully acknowledged. The work of the first author was subsequently supported in part by NJIT SBR grant 421350 and NSF grant NSF-9977508.
