Abstract. Merge sort is useful in sorting a great number of data progressively, especially when they can be partitioned and easily collected to a few processors. Merge sort can be parallelized, however, conventional algorithms using distributed memory computers have poor performance due to the successive reduction of the number of participating processors by a half, up to one in the last merging stage. This paper presents load-balanced parallel merge sort where all processors do the merging throughout the computation. Data are evenly distributed to all processors, and every processor is forced to work in all merging phases. An analysis shows the upper bound of the speedup of the merge time as (P − 1)/ log P where P is the number of processors. We have reached a speedup of 8.2 (upper bound is 10.5) on 32-processor Cray T3E in sorting of 4M 32-bit integers.
Introduction
Many comparison-based sequential sorts take O(N log N ) time to sort N keys. To speedup the sorting multiprocessors are employed for parallel sorting. Several parallel sorting algorithms such as bitonic sort [1, 6] , sample sort [5] , column sort [3] and parallel radix sort [7, 8] have been devised. Parallel sorts usually need a fixed number of data exchange and merging operations. The computation time decreases as the number of processors grows. Since the time is dependent on the number of data each processor has to compute, good load balancing is important. In addition, if interprocessor communication is not fast such as in distributed memory computers, the amount of overall data to be exchanged and the frequency of communication have a great impact on the total execution time.
Merge sort is frequently used in many applications. Parallel merge sort on PRAM model was reported to have fast execution time of O(log N ) for N input keys using N processors [2] . However, distributed-memory based parallel merge sort is slow because it needs local sort followed by a fixed number of iterations of merge that includes lengthy communication. The major drawback of the conventional parallel merge sort is in the fact that load balancing and processor utilization get worse as it iterates; in the beginning every processor participates in merging of the list of N/P keys with its partner's, producing a sorted list of 2N/P keys, where N and P are the number of keys and processors, respectively; in the next step and on, only a half of the processors used in the previous stage participate in merging process. It results in low utilization of resource. Consequently, it lengthens the computing time. This paper introduces a new parallel merge sort scheme, called load-balanced parallel merge sort, that forces every processor to participate in merging at every iteration. Each processor deals with a list of size of about N/P at every iteration, thus the load of processors is kept balanced to reduce the execution time.
The paper is organized as follows. In sections 2 we present the conventional and improved parallel merge sort algorithms together with the idea how more parallelism is obtained. Section 3 reports experimental results performed on Cray T3E and PC cluster. Conclusion is given in the last section followed by performance analysis in the appendix.
Parallel Merge Sort

Simple Method
Parallel merge sort goes through two phases: local sort phase and merge phase. Local sort phase produces keys in each processor sorted locally, then in merging phase processors merge them in log P steps as explained below. In the first step, processors are paired as (sender, receiver). Each sender sends its list of N/P keys to its partner (receiver), then the two lists are merged by each receiver to make a sorted list of 2 1 N/P keys. A half the processors work during the merge, and the other half sit idling. In the next step only the receivers in the previous step are paired as (sender, receiver), and the same communication and merge operations are performed by each pair to form a list of 2 2 N/P keys. The process continues until a complete sort list of N keys is obtained. The detailed algorithm is given in Algorithm 1.
As mentioned earlier, the algorithm does not fully utilize all processors. Simple calculation reveals that only P /log P (= {(P/2 + P/4 + P/8 + · · · + 1)/(log P steps)}) processors are used in average. It must have inferior performance to an algorithm that makes a full use of them, if any.
Load-Balanced Parallel Merge Sort
To keep each list of sorted data in one processor is simple and easy to deal with as long as the algorithm is concerned. However, as the size of the lists grows, sending them to other processors for merge is time consuming, and processors that no longer keep lists after transmission sit idling until the end of the sort. The key idea in our parallel sort is to distribute each (partially) sorted list onto multiple processors such that each processor stores an approximately equal number of keys, and all processors take part in merging throughout the execution.
Algorithm 1:
Simple parallel merge sort P : the total number of processors (assume P = 2 k for simplicity.) Pi: a processor with index i h: the number of active processors
Pi sorts a list of N/P keys locally. 
G0 ( Figure 1 illustrates the idea for the merging with 8 processors, where each rectangle represents a list of sorted keys, and processors are shown in the order that store and merge the corresponding list. It would invoke more parallelism, thus shorten the sort time. One difficulty in this method is to find a way how to merge two lists each of which is distributed in multiple processors, rather than stored in a single processor. Our design is described below that minimizes the key movement.
A group is a set of processors that are in charge of one sorted list. Each group stores a sorted list of keys by distributing them evenly to all processors. It also computes the histogram of its own keys. The histogram plays an important role in determining a minimum number of keys to be exchanged with others during ;; ;; ; ; ; the merge. Processors keep nondecreasing (or nonincreasing) order for their keys. In the first merging step, all groups have a size of one processor, and each group is paired with another group called the partner group. In this step, there is only one communication partner per processor. Each pair exchanges its two boundary keys (a minimum and a maximum keys) and determines new order of the two processors according to the minimum key values. Now each pair exchanges group histograms and computes new one that covers the whole pair. Each processor then divides the intervals (histogram bins) of the merged histogram into two parts (i.e. bisection) so that the (half) lower indexed processor will keep the smaller half of the keys, the higher the upper. Now each processor sends out the keys that will belong to other processor(s) (for example, those keys in the shaded intervals are transmitted to the other processor in Figure 2 ). Each merges keys with those arriving from the paired processor. Now each processor holds N/P ±∆ keys because the bisection of the histogram bins may not be perfect (we hope ∆ is relatively small compared to N/P ). The larger the number of histogram bins, the better the load balancing. In this process, only the keys in the overlapped intervals need to merge. It implies that keys in the non-overlapped interval(s)
do not interleave with keys of the partner processor's during the merge. They are simply placed in a proper position in the merged list. Often there maybe no overlapped intervals at all, then no keys are exchanged. From the second step and on, the group size (i.e. the number of processors per group) grows twice the previous one. Merging process is the same as before except that each processor may have multiple communication partners, up to the group size in the worst case. Now boundary values and group histograms are again exchanged between paired groups, then the order of processors is decided and histogram bins are divided into 2 i parts at the ith iteration. Keys are exchanged between partners, then each processor merges received keys. One cost saving method is used here called index swapping. Since merging two groups into one may require many processors to move keys, only ids of the corresponding processors are swapped to have a correct sort sequence if needed, in order not to sequentially propagate keys of processors to multiple processors. Index swapping minimizes the amount of keys exchanged among processors. The procedure of the parallel sort is summarized in Algorithm 2. processors */ 2.3. Each processor sends keys to the designated processors that will belong to others due to the division. 2.4. Each processor locally merges its keys with the received ones to obtain a new sorted list.
Broadcast logical ids of processors for the next iterations.
Rather involved operations are added in the algorithm in order to minimize the key movement since the communication in distributed memory computers is costly. The scheme has to send boundary keys and histogram data at each step, and a broadcast for the logical processor ids is needed before a new merging iteration. If the size of the list is fine grained, the increased parallelism may not contribute to shortening the execution time. Thus, our scheme is effective when the number of keys is not too small to overcome the overhead.
Experimental Results
The new parallel merge sort has been implemented on two different parallel machines: Cray T3E and Pentium III PC cluster. Notice that T3E is expected to achieve the highest performance enhancement due to having the biggest C introduced in Eq (9) in the appendix. The speedups in merge time of the loadbalanced merge sort over the conventional merge sort are shown in Figure 3 and 4. The speedups with gauss distribution are smaller than those with uniform distribution since ∆ in Eq. (7) is bigger in gauss distribution than in uniform distribution. The improvement gets better as the number of processors increases. The measured speedups are close to the predicted ones when the number of N/P is large. When N/P is small, the performance suffers due to the overhead such as in exchanging boundary values and histogram information, and broadcasting processor ids. Experimental results of T3E having the higher speedup supports the analytic result given in Eq. (8) . The comparisons of the total sorting time and distribution of the load balanced merge sort with the conventional algorithm are shown in Figure 5 . Local sort times of both methods remain same in one machine. The best speedup of 8.2 in merging phase is achieved on T3E with 32 processors.
Conclusion
We have improved the parallel merge sort by keeping and computing approximately equal number of keys in all processors through the entire merging phases. Using the histogram information, keys can be divided equally regardless of their distribution. We have achieved a maximal speedup of 8.2 in merging time for 4M keys on 32-processor Cray T3E, which is about 78% of the upper bound. This scheme can be applied to parallel implementation of similar merging algorithms such as parallel quick sort.
A Appendix
The upper bound of the speedup of the new parallel merge sort is estimated now. Let T seq (N/P ) be the time for the initial local sort to make a sorted list. 
where K 1 and K 2 are the average time to transmit one key and the average time per key to merge N keys, respectively, and S is the startup time. The parameters Ks and S are dependent on machine architecture.
For Algorithm1, step1 requires T seq (N/P ).
Step 2 repeats log P times, so execution time of the simple parallel merge sort (SM) is estimated as below:
In Eq.(3) the communication time was assumed proportional to the size of data by ignoring the startup time (Coarse-grained communication in most interprocessor communication networks reveal such characteristics). For Algorithm 2, step 1 requires T seq (N/P ). The time required in steps 2.1 and 2.2 is ignorable if the number of histogram bins is small compared to N/P . Since the maximum number of keys assigned to each processor is N/P , so at most N/P keys are exchanged among paired processors in step 2.3. Each processor merges N/P + ∆ keys in step 2.4. Step 2.5 requires O(log P ) time. The communication of steps 2.1 and 2.2 can be ignored since the time is relatively small compared to the communication time in step 2.3 if N/P is large (coarse grained). Since step 2 is repeated log P times, the execution time of the loadbalanced parallel merge sort (LBM) can be estimated as below:
To observe the enhancement in merging phase only, the first terms in Eqs. (3) and (4) will be removed. Using the relationship in Eqs. (1) and (2), merging times are rewritten as follows:
A speedup of the load-balanced merge sort over the conventional merge sort, denoted as η, is defined as the ration of T CM to T LBM :
If the load-balanced merge sort keeps load imbalance small enough to ignore ∆, and N/P is large, Eq. (7) can be simplified as follows:
where C is a value determined by the ratio of the interprocessor communication speed to computation speed of the machine as defined below
