Two new parallel integer sorting algorithms, queue-sort and barrel-sort, are presented and analyzed in detail. These algorithms do not have optimal parallel complexity, yet they show very good performance in practice. Queue-sort is designed for ne-scale parallel architectures which allow the queueing of multiple messages to the same destination. Barrel-sort is designed for medium-scale parallel architectures with a high message passing overhead. The performance results from the implementation of queue-sort on a Connection Machine CM-2 and barrelsort on a 128 processor iPSC/860 are given. The two implementations are found to be comparable in performance but not as good as a fully vectorized bucket sort on the Cray YMP.
the lower-bound O(n) time for sequential integer sorting. On a parallel machine, the performance bounds are limited by processors as well as time. Therefore the performance bound of parallel algorithms must be measured as the product of the processor bound, P, and the time bound, T. A parallel algorithm is optimal if its performance bound PT is equal to the sequential time bound, T s , for the problem. Several optimal parallel integer sorting algorithms have been proposed (see 14, 9] ). However these algorithms have proved unsuitable for implementation on single instruction multiple data (SIMD) or multiple instruction multiple data (MIMD) distributed memory machines like the Connection Machine CM-2 or the Intel iPSC/860. This paper presents two parallel integer sorting algorithms which, although not optimal, have been implemented and shown to give good performance on these machines. Some theoretical analysis of these algorithms is presented, however the algorithms of this paper were borne out from an applications oriented perspective and emphasis is given to the application analysis.
The remainder of this section de nes key terms used throughout the paper. Section 2 overviews the two machine models used. Sections 3 and 4 introduce and analyze the two sorting schemes, namely queue-sort and barrel-sort, respectively. Performance results are compared and discussed in section 5 and conclusions are presented in section 6.
Some De nitions.
A sequence of keys, fK i ji = 0; 1; : : :; N ? 1g, will be said to be sorted if it is arranged in non-descending order, i.e. K i K i+1 K i+2 : : :. The rank of a particular key in a sequence is the index value i that the key would have if the sequence of keys were sorted. Ranking, then, is the process of arriving at a rank for all the keys in a sequence. Sorting is the process of permuting the keys in a sequence to produce a sorted sequence. If an initially unsorted sequence, K 0 ; K 1 ; : : :; K N?1 has ranks r(0); r(1); : : :; r(N ? 1), the sequence becomes sorted when it is rearranged in the order K r(0) ; K r(1) ; : : :; K r(N? 1) . Sorting is said to be stable if equal keys retain their original relative order. In other words, a sort is stable only if r(i) < r(j) whenever K r(i) = K r(j) and i < j. The algorithms presented here are not stable. Key density refers to the number of equal keys in a sequence. The pre x sum of a sequence is the sequence obtained as the running sum of the original sequence elements. The j th element of the pre x sum of sequence K i is given by K j = P j 0 K i . Pre x operations are also referred to as scan operations 5] . A scan operation with binary operator across an ordered set a 0 ; a 1 ; : : :; a n?1 ] returns the ordered set a 0 ; (a 0 a 1 ); : : :; (a 0 a 1 a n?1 )]. All logarithms are to base 2 unless otherwise indicated.
2. Machine Models. The algorithms presented here were implemented on two di erent parallel machines at NASA Ames, the Thinking Machines Connection Machine CM-2 and the Intel iPSC/860. The architectures are brie y described below.
2.1. Connection Machine. The CM-2 is a massively parallel SIMD computer consisting of many thousands of bit serial data processors under the direction of a front end computer. The system at NASA Ames consists of 32768 bit serial processors each with with 1 Mbit of memory and operating at 7 MHz. The processors and memory are packaged as 16 in a chip. Each chip also contains the routing circuitry which allows any processor to send and receive messages from any other processor in the system. In addition, there are 1024 64-bit Weitek oating point processors which are fed from the bit serial processors through a special purpose \Sprint" chip. The Connection Machine CM-2 can be viewed two ways, either as an 11-dimensional hypercube connecting the 2048 CM chips or a 10-dimensional hypercube connecting the 1024 processing elements. The rst view is the \ eldwise" model of the machine which has existed since its introduction. This view admits to the existence of at least 32768 physical processors (when using the whole machine) each storing data in elds within its local memory. The second is the more recent \slicewise" model of the machine which admits to only 1024 processing elements (when using the whole machine) each storing data in slices of 32 bits distributed across the 32 processors in the processing element. Both models allow for \virtual processing", where the resources of a single data processor may be divided to allow a greater number of virtual processors.
Regardless of the machine model, the architecture allows interprocessor communication to proceed in three manners. For very general communication with no regular pattern, the router determines the destination of messages at run time and directs the messages accordingly. This is referred to as general router communication. For communication with an irregular but static pattern, the message paths may be pre-compiled and the router will direct messages according to the pre-compiled paths. This is referred to as compiled communication and can be 5 times faster than general router communication. Finally, for communication which is perfectly regular and involves only shifts along grid axes, the system software optimizes the data layout by ensuring strictly nearest neighbor communication and uses its own pre-compiled paths. This is referred to as NEWS (for \NorthEastWest-South") communication. Despite the name, NEWS communication is not restricted to 2-dimensional grids and up to 31-dimensional NEWS grids may be speci ed. NEWS communication is the fastest.
The Connection Machine's processors are used only to store data. The program instructions are stored on a front end computer which also carries out any scalar computations. Instructions are sequenced from the front end to the CM through one or more sequencers. Each sequencer broadcasts instructions to 8192 processors and can execute either independent of other sequencers or combined in two or four. The complete system is controlled by a system resource module (SRM), which is based on an Intel 80386 processor. This system handles compilation and linking of source programs, as well as loading the executable code into the hypercube nodes and initiating execution. Programs generally make no use of the SRM once they begin execution on the nodes.
3. Fine-Scale Parallel Integer Sort. The ne-scale parallel integer sorting algorithm is similar to that described in 10], however it makes use of the send to queue instruction 16] on the Connection Machine CM-2. This is a very powerful instruction that takes multiple messages for the same destination and stores them in a queue at the receiving processor. Each processor must have the same size bu er allocated to store the queue. This restriction is due to the SIMD nature of the Connection Machine, which employs a single stack pointer for processor memories and thus it is impossible to allocate variable amounts of memory across processors. The allocated bu er must also include a word in which to store the number of elements destined for the queue. If the bu er can store q s messages, and some number greater than q s of messages are sent to a particular processor, then the excess messages are lost but this word will still store the number of messages intended for that destination.
3.1. Fine-Scale Parallel \Queue-Sort" Algorithm. The n keys are stored in a one dimensional virtual processor (VP) set, call it VP1, of size n. Each VP has an index i and stores key K i . The keys have range 1; m], where m is no greater than O(n), therefore m buckets are needed to sort them. The main idea behind the algorithm is to create a queue for each bucket, perform a pre x sum over queue elements to compute the rank, and return the rank. The algorithm must be iterated when there are key densities greater than the maximum queue size. The steps in a single iteration of the queue-sort algorithm are as follows: Queue-Sort Algorithm 1. In a distinct VP set, call it VP2, allocate memory in m virtual processors for a queue of size q s . The value of q s will depend on the available memory, in the analysis below we assume mq s = O(n).
2. Each processor in VP1 computes a destination address in VP2 based on the value of its key. The n processors of VP1 then collectively send their self-address to this destination using send to queue. 3.2. Theoretical Analysis. Blelloch 5] describes a \scan-model" of computation for the Connection Machine (that is, the Exclusive Read Exclusive Write (EREW) model but including pre x of \scan" operations as unit-time primitives). This model is assumed in the following analysis.
The performance of this algorithm depends on the key density distribution. If max is the maximum key density, then d max =q s e iterations are required to complete the sort. Recall q s is xed by the available memory. Assume that O(n) words of memory are available for m queue's, such that q s = O(n)=m. In the following we will allow m to be any number less than but evenly divisible into n, yet greater than or equal to the number of physical processors, N p . For example, m can be: n= log 2 n; n= log n or n for n = 2 32 . Therefore q s will have size O(n=m). Clearly then, steps 1 and 5 will take O(1) time and step 4 will take O(n=m) time, each with O(m) processors. Communication is required in steps 2, 3 and 6; these require special consideration. In the scan model of the CM-2, step 3 requires O(1) time to complete using O(m) processors. The time for step 2 will depend on the number of combinations required to complete the send to queue. As is shown in the application analysis below, this is essentially given by the ratio n=m so step 2 has time complexity O(n=m).
Step 6 must be carried out iteratively using at most q s sub-iterations Step 3 gets executed just once and has negligible e ect on the overall execution time. Therefore only steps 2 and 6 need be considered for analysis. Since network contentions is reduced as fewer messages are transmitted, and since each successive iteration requires fewer messages to transmit, the time required by steps 2 and 6 decreases as the calculation proceeds. In the following, models are developed to account for the e ect of network contention on communication performance. A sample of the key density distribution is shown in gure 1. The maximum key density is actually 73, but the gure only shows key densities for every 128 th key value and the maximum is missed. There are 417,812 di erent keys. As expected, steps 2 and 6 took the majority ( 97%) of the time. Figure 2 presents the time to sort, using 8k processors, as a function of the queue size. Note that the time spent in step 6 is independent of the queue size. Changing the queue size changes the number iterations necessary for queue-sort to complete. However, the total number of subiterations taken in step 6 always must equal max , for this reason its time is una ected by the size of q s . On the other hand, the time spent in step 2 is a ected by the size of q s . As q s increases, fewer iterations are required so the overhead in using send to queue is paid fewer times. However, even when the number of iterations is constant, the time spent in send to queue decreases with increasing q s . This implies that send to queue behaves in a manner similar to a conventional send in that the communication time is determined by the network bandwidth. The queueing of messages occurs in the network, so network contention has a great impact on the performance of send to queue. In the rst iteration, all processors in VP1 are sending messages to VP2, therefore the time required by send to queue is constant regardless of q s . However, in the second iteration, the number of active processors in VP1 decreases as q s increases (because fewer keys remain to be ranked). Therefore the communication time decreases because of reduced network contention. Figure 3 presents the time accumulated by each send to queue instruction as q s increases. It is evident from this gure that reducing network contention is more important than decreasing the communication start up cost in terms of improving performance. The issue of network contention is discussed more fully below. Figure 4 presents the fraction of active processors in VP2 per subiteration of step 6. The values have been normalized by the total number of processors in the VP set. From this curve one can determine the number of messages communicated in a particular subiteration of step 6. Network contention a ects step 6 in the same manner as step 2. As there are fewer keys remaining to be ranked, network tra c decreases and the communication time for each subiteration of step 6 decreases.
The solid curve in gure 5 presents the time spent per subiteration of . A sample of the key density distribution is shown in gure 7; the maximum key density was 58 and there were 507810 di erent keys. Figure 8 presents the number of active processors in VP2 per subiteration of step 6. Using gure 8, the model for T s predicts a time of 8.1 sec to complete step 6. The measured time was 7.7 sec, which is within 5% of our predicted time. Figure 9 presents the measured and predicted times for T q as a function of R act . The agreement is very close especially for large values of R act , which, of course, is how the model was calibrated. Finally, gure 10 presents the measured and predicted times for queue-sort for the density distribution of gure 7. Again there is excellent agreement, with the model being accurate to 5% of the measured times.
It should be obvious from these results that the e ect of decreasing network contention on communication performance must be considered when analyzing communication-iterative algorithms like queue-sort. The success of these models in predicting communication performance as a function of network contention is extremely encouraging and indicates that the performance of complex and changing communication patterns can be predicted with some accuracy using relatively simple models for communication. 4. Each processor determines which subrange a k ; b k ) contains each of its keys and stores the result in a local array P i . Each value in P i is a pointer to the processor whose assigned subrange of buckets contains key value K i . This step requires a binary search in f(a k ; b k )jk 2 1; p]g.
5.
Each processor sends to all other processors the keys in that other processor's subrange. This is carried out by having each processor rank its list of keys according to P i and permute the key and index sequences, fK i g and fig, accordingly. Let fK q g be the sorted fK i g and let fI q g be the index in fK i g for fK q g. Note that P i has range 1,p] and sorting is carried out strictly local to each processor. Each processor then sends the appropriate subsequence of fK q g and fI q g to the corresponding processor. This is an all-to-all (or complete exchange) type of communication with message lengths of varying sizes which permutes fK q g into a new sequence fK r g. At the end of this step, every processor k stores a subsequence of fK r g, approximately of length n=p, and the corresponding subsequence of fI r g, where fI r g are the indices in fK i g for the keys in fK r g. Furthermore, each subsequence of fK r g has range a k ; b k ).
6. Each processor ranks its subsequence of keys and permutes its subsequence of fI r g accordingly. Let fI s g be this permuted sequence. Permuting fK r g at this point would result in a sorted sequence of keys. However, the objective is not to sort the original sequence of keys but rather to nd the permutation which sorts it (under the assumption that the records associated with the keys are large and one wants to permute them just once). Nonetheless, assume such a permutation was carried out, then fK s g would be the sorted sequence of fK i g, and fI s g would be the index in fK i g for fK s g. Therefore the permutation, fR i g, which converts fK i g into fK s g is where k is the number of bytes in the message, t is the latency, and t send is the time per byte. Using the numbers from 8], for long messages (greater than 100 bytes), t = 149 sec and t send = 0:36 sec and for short messages t sh = 74 sec and t send sh = 0:19 sec.
Step 2 can be implemented using a pairwise exchange algorithm such that only log p messages are required per processor, each of length 4m 0 bytes. Each complete exchange in steps 5 and 6 is implemented as 3(p ? 1) 5. Discussion. Table 1 presents the best results for queue-sort and barrel-sort on the Connection Machine CM-2 and the iPSC/860 respectively. Times are given for both the Gaussian distributed and the linear distributed key densities (see gures 1 and 7). The maximum queue size in queue-sort was made large enough to allow completion in a single iteration. In a real application, the memory available for the queues would be limited to that which was unused by the rest of the application, the resulting impact on performance may be deduced from gure 2. The number of barrels used in barrel-sort was 2048. The current implementation of barrel-sort could not be run with less than 64 processors because of memory restrictions, although with some modi cations 32 processor results could be obtained. For comparison, the performance of a vectorized bucket sort on 1 and 8 processors of the Cray YMP is also given. The Cray YMP code was obtained from Cray Research Inc. in response to the NAS Parallel Benchmarks and is highly tuned to this architecture. It is encouraging to see that the performance of queue-sort and barrelsort are comparable. Queue-sort involves virtually no arithmetic computation but depends on many single-word transmissions to order the data. The total amount of data motion is about the same for both queue-sort and barrel-sort. Queue-sort essentially consists of n single-word transmissions using n processors followed by n single-word transmissions using n=m processors. Therefore for queue-sort to be competitive the overhead on message transmission must be very low. On the other hand, barrel-sort consists of two complete exchanges each involving p?1 transmissions of approximately n=p 2 words using p processors. Barrel-sort attempts to minimize the number of messages transmitted at the expense of additional arithmetic computation. Therefore barrel-sort should perform well on machines with a high overhead on message transmission so long as medium-scale parallelism is available.
The Cray YMP performance taken from 4] is given for comparison. Neither queue-sort on the CM-2 nor barrel-sort on the iPSC/860 can match the performance of a vectorized bucket-sort on the YMP. In general, sorting requires a very high memory bandwidth and relatively little computation. The high memory bandwidth is a well known feature of the Cray machines and one expects good performance on the YMP for this problem. These parallel architectures have exhibited comparatively good performance for this problem with highly tuned radix sorts (see 2, 4, 6]), however the results presented in this paper are intended primarily for understanding the algorithms, not benchmarking the machines. Benchmark results for this problem may be found in 4]. 6 . Conclusions. Two new parallel integer sorting algorithms, queuesort and barrel-sort, have been presented and analyzed in detail. These algorithms do not have optimal parallel complexity, yet they show respectable performance in practice. Queue-sort is designed for ne-scale parallel architectures which allow the queueing of multiple messages to the same destination. Barrel-sort is designed for medium-scale parallel architectures with a high message passing overhead. The performance results from the implementation of queue-sort on a Connection Machine and barrel-sort on a 128 processor iPSC/860 are presented and compared to a a fully vectorized bucket sort on the Cray YMP. The parallel machines show poorer performance because of their comparatively slower memory systems.
