he is now a staff member. His current interests are in the fields of high-quality digital pseudorandom sources, real-time processors, directly executed simulation languages and multimicroprocessing systems. He has written a number of articles and he is a contributor to the monograph "Digital Simulation Methods," edited by M. G. Hartley. Abstract-In this paper, we describe a family of parallel-sorting algorithms for a multiprocessor system. These algorithms are enumeration sortings and comprise the following phases: 1) count acquisition: the keys are subdivided into subsets and for each key we determine the number ofsmaller keys (count) in every subset; 2) rank determination: the rank of a key is the sum ofthe previously obtained counts; 3) data rearrangement: each key is placed in the position specified by its rank. The basic novelty of the algorithms is the use of parallel merging to implement count acquisition. By using Valiant's merging scheme, we show that n keys can be sorted in parallel with n log2 n processors in time C log2 n + o(log2 n); in addition, if memory fetch conflicts are not allowed, using a modified version of Batcher's merging algorithm to implement phase 1), we show that n keys can be sorted with n' + processors in time (C'/a) log2 n + o(log2 n), thereby matching the performance of Hirschberg's algorithm, which, however, is not free of fetch conflicts.
I. INTRODUCTION rT HE EFFICIENT implementation of comparison prob-1 lems, such as merging, sorting, and selection, by means of multiprocessor computing systems has attracted considerable attention in recent years. One of the earliest fundamental results is due to Batcher [1] , who proposed a Manuscript received May 5, 1977; revised September 30, 1977. This work was supported in part by the National Science Foundation under Grant MCS76-17321 and in part by the Joint Services Electronics Program under Contract DAAB-07-72-C-0259.
The author is with the Coordinated Science Laboratory, Department of Electrical Engineering and the Department of Computer Science, University of Illinois, Urbana, IL 61801.
sorting network consisting of comparators and based on the principle of iterated merging; as is well-known, such scheme sorts n keys with 0(n(logn)2) comparators in time 0((logn)2). Batcher's network is readily interpreted, in a more general framework, as a system of n/2 processors with access to a common data memory of n cells: obviously, the network structure induces a nonadaptive schedule of memory accesses. After the appearance of Batcher's paper, substantial work was aimed at filling the gap between the upperbound 0((logn)2) on the number of steps which is achievable by a network of comparators and the lowerbound 0(logn); the lack of success, however, convinced several workers to look for more flexible forms of parallelism.
The first scheme shown to sort n keys in time 0(logn) is due to Muller and Preparata [2] , but it requires a discouraging number of 0(n2) processors. Subsequently, new results were obtained on parallel merging by Gavril [3] . Valiant [4] must be credited with addressing the fundamental question ofthe intrinsic parallelism of some comparison problems and with the development of faster algorithms than were previously known. In particular, in [4] he described an algorithm for merging with /nm processors two sorted sequences ofn and m keys, respectively, (n < m), in 2loglogn + 0(1) comparison steps;' this algorithm can then be applied to sort n keys with n processors in 2logn loglogn + 0(logn) steps. His method assumes a computational model in which there is no penalty ' Throughout this paper "log" means "logarithm to the base 2." 0018-9340/78/0700-0669$00.75 C) 1978 IEEE 6EE E TRANSACTIONS ON COMPUTERS, VOL. c-27, NO. 7, JULY 1978 for memory-processor alignment and the overhead corresponding to the reassignment of sets of processors to subsequences to be merged, is ignored.
A new family of sorting algorithms has been recently discovered by Hirschberg [5] 
end Note that count acquisition, rank computation, and data rearrangement are performed, respectively, in steps 2,3, and 4. Also the algorithm must insure that all ranks be distinct, which is a crucial condition for the data rearrangement task (otherwise, memory store conflicts would occur). This clearly poses no problem when the keys are all distinct. In the opposite case, some convention must be adopted for the ordering of sets of identical keys. One such convention is that-sorting be stable (see [6, p. 4] ); that is, the initial order of identical keys is preserved in the sorted array. Thus all ofour sorting schemes will be stable. This is reflected in the rules for the computation of the parameters C0" in Step 2 of the algorithm just presented.
The simple algorithm proposed by Muller and Preparata in [2] is a crude example ofenumeration sorting, in which the sets A, are chosen to be singletons. With this choice, each key is compared with every other key, thereby using 0(n2) processors; similarly, rank computation uses 0(n2) processors, since 0(n) processors are assigned to each key. The time bound 0(logn) is due to Step 3 (counting in parallel the number of l's in a set of n binary digits), whereas Steps 2 and 4 run in constant time in our present model.
In the more complex procedures to be described later, the operations of rank computation and data rearrangement are essentially carried out as in the basic scheme just described. The main difference occurs with regard to count acquisition. In the Muller- [4] . Notice that this model of parallel computation coincides with that required by Valiant's merging algorithm.
We assume inductively that the following algorithm, Sortl, for p < n uses at most Lplogpi processors to sort p keys. Since Sortl is recursive, the following presentation constitutes a constructive extension of the inductive step to the integer n. The induction can be started with n > 4. R[i;Oi; --k (i=, ,k; I=O, ,r-1 1 Comment: Steps 6 and 7 complete the count acquisition task. In fact, after Step 7 the content of R[i; j; ] is C(iJI, in the terminology of Section I.
Step 6 can be executed in two time units using (k ')r processors, whereas Step 7 uses (k + 1)r processors and runs in one time unit. To complete the analysis ofthe algorithm, we observe that none of Steps 4-7 uses more than (k 2 ')r processors. But
< 2 where the last inequality is due to the removal of the "floor" sign. Also, Step 8 uses nl(k + 1)/21 < n([logn] + 1)/2. Since, for all n > 4, n([lognl + 1)/2 < LnlognJ, the inductive hypothesis on the number of processors is extended.
Finally, let T(n) denote the running time of the algorithm for n keys. Since r n/logn we obtain T(n) = T ( -I + C21oglogn + C3 for some constants C2 and C3. It is easily verified that a function of the form C2(0ogn) + o(logn) is a solution of the above recurrence. It is worth noting that for the same number of processors, Valiant proposes a sorting scheme of the merge-sort type [4, corollary 8] which runs in time 2logn loglogn -o(logn loglogn).
III. PARALLEL-SORTING ALGORITHMS WITH NO MEMORY FETCH CONFLICTS
We shall now consider a family ofalgorithms for sorting n numbers in parallel with n1 + processors (0 < a < 1) in-time (C'/cx)logn + o(logn), for some constant C'. Each of these algorithms has the same performance as the corresponding algorithm by Hirschberg [5] , although no memory fetch conflict occurs in this case. Again, we make the inductive hypothesis that for p < n, Algorithm Sort2 uses p1 + a processors to sort p keys. The format of Sort2 closely parallels that of SortI, with a few crucial differences to be noted. used. Since n-kr < k, then N < kr1'+ + (n -kr)-k = kr(ra -k") + n * k". Also kr = [n"I Ln/rnal] < n, whence N < n(r2 -k2 + ko) _ n n( n -a < n I+,% processors are used; 3) merging is completed in logr -(1 -a)logn time units. 8.
Steps 8-11 of this algorithm are, respectively, identical to Steps 6-9 of Sortl and are therefore omitted. The latter are clearly free ofmemory fetch conflicts. The analysis of Sortl showed that at most max((k 21)r, nL(k + 1)/21) processors were used in any of those steps. In the present case, we have already shown that (k 2 )r < n ; similarly, we conclude nL(k + 1)/21 < n(n" + 1)/2 < n + .
From the performance viewpoint, all steps of the algorithm require at most nl+a processors, as postulated. This extends the inductive hypothesis on the number of processors used by the algorithm. As to the running time T(n), we note the following: Steps 4-6 jointly require alogn + 1 time units;
Step 7 requires (1 -a)logn time units;
Step 10 requires alogn time units; Steps 8, 9, and 11 run in constant time. Since
Step 3 is a recursive call of Sort2 on sets of r : n'-elements, we obtain for T(n) the recurrence equation T(n) = T(n-) + (C'o + C'2)logn + C'3
for some constants C', C'2, and C'3. It is easily verified that a function of the form [C' (x + C'2)/a]logn + o(logn) is a solution of this equation, whence T(n) < (C'/c4logn + o(logn).
APPENDIX A STABLE VERSION OF BATCHER'S MERGING ALGORITHM
The original version of Batcher's odd-even mergingalgorithm runs as follows (here, for simplicity, we assume that the common length of the sequences to be merged is a power of 2):
