We present a hardware-algorithm for sorting N elements using either a p-sorter or a sorting network of xed I/O size p, while strictly enforcing con ict-free memory accesses. To the best of our knowledge, this is the rst realistic design that achieves optimal time performance, running in ? N log N p log p time for all ranges of N. Our result completely resolves the problem of designing an implementable, time-optimal algorithm for sorting N elements using a p-sorter. More importantly, however, our result shows that in order to achieve optimal time performance all that is needed is a sorting network of depth O(log 2 p) such as, for example, Batcher's classic bitonic sorting network.
Introduction
Recent advances in VLSI have made it possible to implement algorithm-structured chips as building blocks for high-performance computing systems. Since sorting is one of the most fundamental computing problems, it makes sense to endow general-purpose computer systems with a special-purpose parallel sorting device, invoked whenever its services are needed.
In this article, we address the problem of sorting N elements using a sorting device of I/O size p, where N is arbitrary and p is xed. The sorting device used is either a p-sorter or a sorting network of xed I/O size p. We assume that the input as well as the partial results reside in several constant-port memory modules. In addition to achieving time-optimality, it is crucial that we sort without memory access con icts.
In real-life applications, the number N of elements to be sorted is much larger than the xed size p that a sorting device can accommodate. In such a situation, the sorting device must be used repeatedly in order to sort the input. The following natural question arises: \How should one schedule memory accesses and the calls to the sorting device in order to achieve the best possible sorting performance?" Clearly, if this question does not nd an appropriate answer, the power of the sorting device will not be fully utilized.
A p-sorter is a sorting device capable of sorting p elements in constant time. Computing models for a psorter do exist. For example, it is known that p elements can be sorted in O(1) time on a p p recon gurable T , the time required for sorting becomes O(D(T ) f(N; p)). To eliminate this O(D(T )) slowdown factor, the network must be used in a pipelined fashion. In turn, pipelining requires that su cient parallelism in the p-sorter operations be identi ed and exploited. Recently, Olariu, Pinotti and Zheng 9] introduced a simple but restrictive design { the row merge model { and showed that in this model N elements can be sorted in N p log N time using either a p-sorter or with a sorting network of I/O size p. To achieve better sorting performance, a new algorithm-structured architecture must be designed. This involves devising a sorting algorithm suitable for hardware implementation and, at the same time, an architecture on which the algorithm can be executed directly. Such an algorithm-architecture combination is commonly referred to as a hardware-algorithm. The major contribution of this article is to present the rst realistic hardware-algorithm design for sorting an arbitrary number of input elements using a xedsize sorting device in optimal time, while strictly enforcing con ict-free memory accesses. We introduce a parallel sorting architecture specially designed for implementing a carefully designed algorithm. The components of this architecture include a parallel sorting device, a set of random-access memory modules, a set of conventional registers, and a control unit. This architecture is very simple and feasible for VLSI realization.
We show that in our architectural model N elements can be sorted in N log N p log p time using either a p-sorter or a sorting network of xed I/O size p and depth O(log 2 p). In conjunction with the theoretical work of 2], our result completely resolves the problem of designing an implementable, time optimal, algorithm for sorting N elements using a p-sorter. More importantly, however, our result shows that in order to achieve optimal sorting performance a p-sorter is not really necessary: all that is needed is a sorting network of depth O(log 2 p) such as, for example, Batcher's classic bitonic sorting network. As we see it, this is exceedingly important since any known implementation of a p-sorter requires powerful processing elements, whereas Batcher's bitonic sort network uses simple comparators.
2
In this section we describe the architectural framework within which we specify our optimal sorting algorithm using a xed-size sorting device. We consider that a sequential sorting algorithm is adequate for the case N < p 2 . Consequently, from now on, we assume that N p 2 :
(
This assumption implies that just for addressing purposes we need at least 2 logp bits. 1 For the reader's convenience, Figure 1 depicts our design for p = 9. To keep the gure simple, control signal lines are not shown. The basic architectural assumptions of our sorting model include: (i) A data memory organized into p independent, constant-port, memory modules M 1 ; M 2 ; : : :; M p . Each word is assumed to have a length of w bits, with w 2 log p. We assume that the N input elements are distributed evenly, but arbitrarily, among the p memory modules. The words having the same address in all memory modules are referred to as a memory row. Each memory module M i is randomly addressed by an address register AR i , associated with an adder. Register AR i can be loaded with a word read from memory module M i or by a row address broadcast from the CU (see below).
(ii) A set of data registers, R i , (1 i p), each capable of storing a (w + 1:5 logp)-bit word. We refer to the word stored in register R i as a composed word, since it consists of three elds:
an element eld of w bits for storing an element, a long auxiliary eld of logp bits, and a short auxiliary eld of 0:5 logp bits.
These elds are arranged such that the element eld is to the left of the long auxiliary eld, which is to the left of the short auxiliary eld. Each eld of register R i can be loaded independently from memory module M i , from the i-th output of the sorting device, or by a broadcast from the CU. The output of register R i is connected to the i-th input of the sorting device, to the CU, and to memory module M i . We assume that:
In constant time, the p elements in the data registers can be loaded into the address registers or can be stored into the p modules addressed by the address registers. The bits of any eld of register R i , (1 i p), can be set/reset to all 0's in constant time.
All the elds of data register R i , (1 i p), can be compared with a particular value, and each of the individual elds can be set to a special value depending on the outcome of the comparison. Moreover, this parallel compare-and-set operation takes constant time.
(iii) A sorting device of xed I/O size p, in the form of a p-sorter or of a sorting network of depth O(log 2 p). We assume that the sorting device provides data paths of width w + 1:5 logp bits from its input to its output. The sorting device can be used to sort composed words on any combination of their element or auxiliary elds. In case a sorting network is used as the sorting device, it is assumed that the sorting network can operate in pipelined fashion.
(iv) A control unit (CU, for short), consisting of a control processor capable of performing simple arithmetic and logic operations and of a control memory used to store the control program as well as the control data. The CU generates control signals for the sorting device, for the registers, and for memory accesses. The CU can broadcast an address or an element to all memory modules and/or to the data registers, and can read an element from any data register. We assume that these operations take constant time.
Described above are minimum hardware requirements for our architectural model. In case a sorting network is used as the sorting device, one can use a \half-pipelining" scheme: the input to the network is provided in groups of D rows. The next group is supplied only after the output of the previous group is obtained. D is the depth of the sorting network. For the sorting network to operate at full capacity, one may add an additional set of address (resp. data) registers. One set of address (resp. data) registers is used for read operations, while the order set is used for write operations; both operations are performed concurrently.
Let us now estimate the VLSI area that our design uses for hardware other than data memory, the sorting device and the CU under the word model, i.e. assuming that the word length w is a constant. We exclude the area taken by the CU: this is because in a high-performance computer system, one of the processors can be assigned the task of controlling the parallel sorting subsystem. Clearly, the extra area is only that used for the address and the data registers, and this amounts to O(p) { which does not exceed the VLSI area of any implementation of a p-sorter or of a sorting network of I/O size p.
We do not include the VLSI area for running the data memory address bus, which has a width of log N p bits, and the control signal lines to data memory and to the sorting device, since they are needed for any architecture involving a data memory and a sorting device. It should be pointed out that for any architecture that has p memory modules involving a total of N p 2 words, the control circuitry itself requires at least 4 (maxflog p; log N p )g) = (log N p ) VLSI area. Since the operations performed by the control processor are simple, we can assume that it takes constant area. The length of the control memory words is at least log N p , which is the length of data memory addresses. As will become apparent, our algorithms require O( N p ) rows of data memory, and consequently, the control memory words have length O(log N p ). The control program is very simple and takes constant memory. However, O( N p 3=2 ) control words are used for control information, which can be stored in data memory.
An Extended Columnsort Algorithm
In this section, we present an extension of the well known Columnsort algorithm 5]. This extended Columnsort algorithm will be implemented in our architectural model and will be invoked repeatedly when sorting a large number of elements. There are two known versions of Columnsort 5, 6] : one involves eight steps, the other seven. We provide an extension of the 8-step Columnsort, because the 7-step version does not map well to our architecture.
Columnsort was designed to sort, in column-major order, a matrix of r rows and s columns. The \classic" Columnsort contains 8 steps. The odd-numbered steps involve sorting each of the columns of the matrix independently. The even-numbered steps permute the elements of the matrix in various ways. The permutation of Step 2 picks up the elements in column-major order and lays them down in row-major order. . We provide such an extension. We show that one additional sorting step is necessary and su cient to complete the sorting in case r s(s?1). Our extension can be seen as trading one additional sorting step for a larger range of applicability of the algorithm. 
Input
Step 1
Step 2
Step 3
Step 4
Step 5
Step 6
Step 7
Step 8
Step 9 is not satis ed. The rst eight steps of this example correspond to the 8-step Columnsort algorithm which does not produce a sorted matrix. By adding one more step, Step 9 , in which the elements in each column are sorted, we obtain an extended Columnsort algorithm. We assume a matrix M of r rows and s columns, numbered from 0 to r ? 1 and from 0 to s ? 1, respectively. Our arguments rely, in part, on the following well-known gem of computer science mentioned by Knuth 4] .
Proposition 1 Let M be a matrix whose rows are sorted. After sorting the columns, the rows remain sorted.
The following result was proved in 5]. Proof. Consider, again, a generic element x that ended up in position M(i; j) at the end of Step 3. The permutation speci c to Step 4 guarantees that x will be moved, in
Step 4, to a position that corresponds, in the sorted matrix, to the element of rank si + j. In general, this is not the correct position of x. However, as we shall prove, x is \close" to its correct position in the following sense: if x is in column c at the end of
Step 4, then in the sorted matrix x must be in one of the pairs of columns (c ? 1; c) (4) implies that y and z must lie in adjacent columns of the sorted matrix. As we saw, at the end of Step 4, x lies in the position corresponding to the element of rank is + j in the sorted matrix. Now, equation (3) Proof. Consider an arbitrary column k, (0 k s?1), at the end of Step 3. The permutation speci ed in
Step 4 guarantees that the rst r s elements in column k will appear in positions k; k+s; k+2s; ; k+r?s in column 0; the next group of r s elements will appear in positions k; k+s; k+2s; ; k+r?s of column 1, and so on. Since the columns were sorted at the end of Step 3, it follows that all the rows k; k+s; k+2s; : : :; k+r?s of M are sorted at the end of Step 4. Since k was arbitrary, the conclusion follows. What we just proved is that no element in a column can belong to the column to its left. A symmetric argument shows that no element belongs to the column immediately to its right, completing the proof. 2 By Lemma 7, one more sorting step completes the task. Thus, we have obtained a 9-step Columnsort that trades an additional sorting step for a larger range of r versus s.
Theorem 1 The extended 9-step Columnsort algorithm correctly sorts an r s matrix such that r s(s?1).
The Basic Algorithm
In this section we show how to sort, in row-major order, m, 1 m p 1 2 , memory rows using our architectural model while enforcing con ict-free memory accesses. The resulting algorithm, referred to as the basic algorithm, will turn out to be the rst stepping stone in the design of our time-optimal sorting algorithm. The basic algorithm is an implementation of the extended Columnsort discussed in Section 3 with m = s and p = r.
Our presentation will focus on the e cient use of a generic sorting device of I/O size p. With this in mind, we shall keep track of the following two parameters that will become key ingredients in evaluating the running time of the algorithm: the number of calls to the sorting device, and the amount of time required by all the data movement tasks that do not involve sorting.
Assume that we have to sort, in row-major order, the elements in m = p 1 2 memory rows. The case 1 m < p 1 2 is perfectly similar. We assume, without loss of generality, that the input is placed, in some order, in memory rows a + 1 through a + p Step 1: Sort all the rows independently. Step 2: Permuting rows.
The permutation speci c to Step 2 of Columnsort prescribes picking up the elements in each memory row and laying them down column by column. For an illustration, consider the case p = 9, with the initial element distribution featured in the following matrix: 1 2 3 4 5 6 7 8 9 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 1 00 2 00 3 00 4 00 5 00 6 00 7 00 8 00 9 00 (6) At the end of Step 2, the permuted matrix reads:
1 4 7 1 0 4 0 7 0 1 00 4 00 7 00 2 5 8 2 0 5 0 8 0 2 00 5 00 8 00 3 6 9 3 0 6 0 9 0 3 00 6 00 9 00 A careful examination of the permuted matrix reveals that consecutive elements in the same memory row will end up in the same memory module (e.g. elements 1, 2, 3 will occur in memory module M 1 ). Therefore, in order to achieve the desired permutation without memory-access con icts, one has to devise a di erent way of picking up the elements in various memory rows. For this purpose, we nd it convenient to view each element x stored in a memory module as an ordered triple hx; row(x); module(x)i where row(x) and module(x) stand for the identity of the memory row and of the memory module, respectively, containing x. Further, we let row(x)jmodule(x) denote the binary number obtained by concatenating the binary representations of row(x) and module(x). The details are spelled out in the following procedure. Clearly, this procedure involves p 1 2 iterations. In each iteration, p words are read, one from each memory module, sorted, and then written back into memory, one word per module, with no read and write memory access con icts. It would seem as though each memory module requires an arithmetic unit to compute the address of the word to be accessed in each iteration. In fact, as we now point out, such arithmetic capabilities are not required. Speci cally, we can use p 1 2 memory rows to store \o sets" used for memory access operations. For the above example, the o sets are 0 1 2 0 1 2 0 1 2 1 2 0 1 2 0 1 2 0 2 0 1 2 0 1 2 0 1 (7) At the beginning of Step 2 all the address registers contain a + 1. In the rst iteration, the entries in the rst row of the o set matrix are added to the contents of the address registers, guaranteeing that the correct word in each memory module is being accessed. As an illustration, referring to (7), we note that the o sets in the rst row indicate that the words involved in the read operation will be found at address a + 1 + 0 in memory module M 1 , address a + 1 + 1 = a + 2 in memory module M 2 , address a + 1 + 2 = a + 3 in memory module M 3 , and so on.
The key observation for understanding what happens in all the iterations is that in any column of the o set matrix (7), once the entry in the rst row is available, the subsequent elements in the same column can be generated by modulo p 1 2 arithmetic. In our architecture, this computation can be performed by the adder associated with each address register. In turn, this observation implies that, in fact, the o set matrix need not be stored at all, as its entries can be generated on the y.
Yet another important point to note is that each ordered triple hx; row(x); module(x)i is a composed word with three elds and that the composed words are sorted using the combination of two elds, namely, row(x) and module(x). Clearly, module(x) has logp bits, but it seems that in order to represent row(x) we need log N p bits. Actually, we can replace row(x) with the address o set contained in the o set matrix discussed above. Since the entries in that matrix are integers no larger than p Step 3: Same as Step 1.
10
Step 4: The permutation of Step 2 is performed in reverse; the permuted set of words are stored in rows a + 1; a + 2; : : :; a + p 1 2 .
Step 5: Same as Step 1.
Step 6: Shifting rows.
We shall permute the elements slightly di erently from the way speci ed by Columnsort. However, it is easy to verify that the elements supposed to end up in a given row, indeed end up in the desired row. Since Step 7 sorts the rows, the order in which the elements are placed in the row in Step 6 is immaterial.
The permutation of Step 6 is best illustrated by considering a particular example. Speci cally, the permutation speci ed by Step 6 of Columnsort involving the three rows shown in (6) Assume that the p 1 2 consecutive input rows are stored in memory starting from memory row a + 1. In addition, we assume that memory row a is available to us. Some of its contents are immaterial and will be denoted by \?"s. The motivation is anchored in the observation that in Step 7 we do not have to sort memory rows a and a + p 1 2 : the elements in these rows will be sorted in Step 9. Consequently, the only rows that have to be sorted in Step 7 are rows a + 1 through a + p time is spent on data movement operations that do not involve sorting.
Step 7: Same as Step 1.
Step 8: This is simply the reverse of the data movement in Step 6.
Step 9: Same as Step 1.
To summarize, we have proved the following result. In the remainder of this section we present an important application of the basic algorithm. Suppose that we wish to merge two sorted sequences A = a 1 a 2 a n and B = b 1 b 2 b n . Our algorithm for merging A and B relies on the following technical result. It is worth noting that Lemmas 8 and 9, combined, show that given two sorted sequences, each of size n, the task of merging them can be handled as follows: we begin by splitting the two sequences into two sequences of size n each, such that no element in the rst one is strictly larger than any element in the second one. Once this \separation" is available, all that remains to be done is to sort the two sequences independently. The noteworthy feature of this approach is that it ts extremely well our architecture. Let It is obvious that procedure MERGE TWO GROUPS can be implemented directly in our architectural model. One point is worth discussing, however. Speci cally, the task of sorting a sequence in non-increasing order can be performed in our architecture as follows. The signs of all the elements to be sorted are ipped and the resulting sequence is then sorted in non-decreasing order. Finally, the signs are ipped back to their original value. The correctness of the procedure follows from Lemmas 8 and 9. Moreover, the procedure requires three calls to the basic algorithm.
Consider the task of sorting a collection of 2mp memory rows, with m as above. Having partitioned the input into two subgroups of m consecutive memory rows each, we use the basic algorithm to sort each group. Once this is done, we complete the sorting using procedure MERGE TWO GROUPS. Thus, we have the following result.
Theorem 4 The task of sorting 2mp, 1 m p 1 2 , elements stored in 2m memory rows can be performed in ve calls to the basic algorithm and O(m) time for data movement operations not involving sorting. 2 5 An E cient Multiway Merge Algorithm Step 3. Partition A into p i?2 2 buckets B 1 ; B 2 ; : : :; B p (i?2)=2 , each containing at most 2mp elements, as discussed below, and move the elements of A to their buckets without memory access con icts;
Step 4. Sort all the buckets individually using the basic algorithm and procedure MERGE TWO GROUPS;
Step 5. Coalesce the sorted buckets into the desired sorted sequence. 2 Notice that if i = 3, the sample S will be stored in one memory row.
14 The remainder of this section is devoted to a detailed implementation of this procedure on our architecture. 2 ) time for data movement and no calls to the sorting device. The sampling process continues, recursively, until a level is reached where procedure MULTIWAY MERGE is invoked with either i = 3, in which case the corresponding sample set is stored in one memory row and will be sorted in one call to the sorting device, or with i = 4, in which case the sample set is stored in m memory rows, and will be sorted in one call to the basic algorithm. Since the operation of sorting one row is direct, we only discuss the way the basic algorithm operates in this context.
Implementing Step 1 and
Conceptually, the process of sorting the samples bene ts from being viewed as one of sorting the concatenation sjk, where s is a sample element and k its sequence index. Recall that, as described in Section 2, our design assumes that the sorting device provides data paths of size w + 1:5 logp from its inputs to its outputs. This implies that Steps 1, 3, 5, 7, and 9 of the extended Columnsort can be executed directly. To sort a row r of S and the corresponding row r of I, the CU loads, in two parallel read operations, the element eld and the short auxiliary eld of data register R j , (1 j p), with S r; j] and I r; j], and the long auxiliary eld with 0.
Let s r;j and k r;j be the element and its sequence index stored in register R j and let s r;j jk r;j denote their concatenation. Next, the contents of the data registers are supplied as input to the sorting device. Let s r;j 0 jk r;j 0 be received by R j after sorting, with s r;j 0 and k r;j 0 stored, respectively in the element and short auxiliary eld of R j . In two parallel write operations, the CU stores the element eld and the short auxiliary eld of each register R j , (1 j p), into S r; j] and I r; j], respectively. Steps 2, 4, 6, and 8 of the basic algorithm perform permutations. The implementation of Steps 6 and 8 does not involve sorting. In this case, the data movement involving the sample elements and that of the 15 corresponding sequence indices will be performed in two companion phases. Speci cally, viewing the sample set S and its corresponding sequence index set I as two matrices, the same permutation is performed on S and I. Steps 2 and 4 of the basic algorithm involve both data movement operations and sorting. The data movement operations in these steps are similar to those in Steps 6 and 8 and will not be detailed any further. Recall that the sorting operations in Steps 2 and 4 of the basic algorithm are performed on the concatenation of the two auxiliary elds storing the relative row number and the column number of the element. Hence, we perform two companion sorting phases, one for permuting the sample elements and the other for permuting sequence indices. Clearly, this can be implemented with the same time complexity.
It is easy to con rm that at the end of Step 2 of procedure MULTIWAY MERGE the sample set S is sorted in row-major order. Furthermore, viewed as matrices, I x; y] is the sequence index of the sample element S x; y]. Let the sorted version of S be S = s 1 s 2 s mp (i?2)=2 :
Equation (8) will be used in Step 3 to partition the elements of A into buckets. In order to do so, the leader of each row in A needs to learn its rank in (8) .
Our next goal is to associate with every memory row in A the rank (x ? 1)p + y of its leader s = S x; y] in S. This task will be carried out in two stages. In the rst stage, using the sequence index and the rank of s in S the CU assigns to s a row number row(s) in A. 
Implementing Step 3 and Step 4
Once the rank of each leader in A is known, we are ready to partition A into buckets. Our rst objective is to construct a collection B 1 ; B 2 ; : : :; B p (i?2)=2 of buckets such that the following conditions are satis ed:
(b1) every element of A belongs to exactly one bucket;
(b2) no bucket contains more than 2mp elements; (b3) for every i and j, (1 i < j p i?2
2 ), no element in B i is strictly larger than any element in B j . Before presenting our bucket partitioning scheme, we need a few de nitions. Let S = s 1 s 2 s mp (i?2)=2 be as in (8) . The memory row with leader s b is said to be regular with respect to bucket B j , (1 j p i?2 2 ), if (j ? 1)m < b jm: (9) Notice that equation (9) guarantees that every memory row in A is regular with respect to exactly one bucket and that the identity of this bucket can be determined by the CU in constant time. Conversely, with respect to each bucket there are exactly m regular memory rows.
A memory row r with leader s b in some sequence A k , (1 k m), is termed special with respect to bucket B t if, with s a standing for the leader of the preceding memory row in A k , if any, we have a tm < b:
Let the memory rows with leaders s a and s b be regular with respect to buckets B j 0 and B j , respectively, such that j 0 < j. It is very important to note that equation (10) implies that the memory row whose leader is s b is special with respect to all the buckets B j 0; B j 0 +1 ; : : :; B j?1 .
Conceptually, our bucket partitioning scheme consists of two stages. In the rst stage, by associating all regular and special rows with respect to a generic bucket B j , (1 j p i?2 2 ), we obtain a set C j of candidate elements for bucket B j . In the second stage, we assign the elements of A to buckets in such a way that the actual elements assigned to bucket B j form a subset of the candidate set C j .
Speci cally, an element x of a memory row regular with respect to bucket B j is assigned to B j if one of the conditions below is satis ed: s (j?1)m < x s jm whenever s (j?1)m < s jm (11) or s (j?1)m x s jm whenever s (j?1)m = s jm : (12) The elements of A that have been assigned to a bucket by virtue of (11) or (12) are no longer eligible for being assigned to buckets in the remainder of the assignment process.
Consider, further, an element x that was not assigned to the bucket with respect to which its memory row is regular. Element x will be assigned to exactly one of the buckets with respect to which the memory row containing x is special. Assume that the memory row containing x is special with respect to buckets B j1 ; B j2 ; : : :; B j l(x) with j 1 < j 2 < : : : < j l(x) . Let j n be the smallest index, 1 n l(x), for which one of the equations (11) or (12) holds with j n in place of j. Now, x is assigned to bucket B jn . The next result shows that the buckets we just de ned satisfy the conditions (b1){(b3).
Lemma 10 Every bucket B j , (1 j p i?2 2 ), satis es the conditions (b1), (b2), and (b3).
Proof. Clearly, our assignment scheme guarantees that every element of A gets assigned to some bucket and that no element of A gets assigned to more than one bucket. Thus, condition (b1) is veri ed.
Further, notice that by (9) and (10), combined, for every j, (1 j p i?2
2 ), the candidate set C j with respect to bucket B j contains at most 2m memory rows, and, therefore, at most 2mp elements of A. Moreover, as indicated, the elements actually assigned to bucket B j are a subset of C j , proving that (b2) is satis ed.
Finally, equations (11) and (12) guarantee that if an element x belongs to some bucket b j then it cannot be strictly larger than any element in a bucket B k with j < k. Thus, condition (b3) holds as well. 2 It is worth noting that the preceding de nition of buckets works perfectly well even if all the input elements are identical. In fact, if all elements are distinct, one can de ne buckets in a simpler way. Moreover, in the case of distinct elements, Steps 1{3 of procedure MULTIWAY MERGE can be further simpli ed.
We now present the implementation details of the assignment of elements to buckets. Write s 0 = ?1 and denote, for every j, (1 j p i?2 2 ), the ordered pair (s (j?1)m ; s jm ) as the j-th bounding pair. Notice that equations (11) and (12) amount to testing whether a given element lies between a bounding pair.
By (b2), no bucket contains more than 2mp elements from A. This motivates us to set aside 2m memory rows for each bucket B j . Out of these, we allocate the rst m memory rows to elements assigned to B j coming from regular memory rows with respect to B j ; we allocate the last m memory rows to elements assigned to bucket B j that reside in special memory rows with respect to B j . In addition, we nd it convenient to initialize the contents of the 2m memory rows allocated to B j to all +1's.
It is important to note that the regular memory rows with respect to a bucket B j are naturally ordered from 1 to m by the order of the corresponding leaders in S. To clarify this last point, recall that by (9) the m leaders belonging to bucket B j are s (j?1)m+1 ; s (j?1)m+2 ; : : :; s jm : Accordingly, the memory row whose leader is s (j?1)m+1 is the rst regular row with respect to B j , the memory row whose leader is s (j?1)m+2 is the second regular row with respect to B j , and so on. Similarly, the fact that each sequence A k is sorted guarantees that it may contain at most one special memory row with respect to bucket B j . Now, in case such a special row exists it will be termed the k-th special memory row with respect to B j , to distinguish it from the others.
In order to move the elements to their buckets, the CU scans the memory rows in A one by one. Suppose that the current memory row being scanned is row r in some sequence A k . We assume that the leader of row r is s b and that the leader of row r ? 1 is s a . Using equation (9) , the CU establishes that row r is regular with respect to bucket B j , where j = d b m e and, similarly, that the previous memory row is regular with respect to bucket B j 0, where j 0 = d a m e j. In case row r is the rst row of A k , j 0 is set to 1.
Next, the elements in memory row r are read into the element elds of the data registers the CU broadcasts to these registers the bounding pair (s (j?1)m ; s jm ). Using compare-and-set, each register stores in the short auxiliary eld a 1 if the corresponding element is assigned to bucket B j by virtue of (11) or (12) and a 0 otherwise. We say that an element x in some data register is marked if the value in the short auxiliary eld is 1; otherwise, x is unmarked.
Clearly, every element x that is marked at the end of this rst broadcast has been assigned to bucket B j . In a parallel write operation, the CU copies all the marked elements to the corresponding words of the ((b ? 1) mod m + 1)-th memory row allocated to bucket B j . Once this is done, using compare-and-set, all the marked elements in the data registers are set to +1 and the short auxiliary elds are cleared.
Further, the CU broadcasts to the data registers, in increasing order, the bounding pairs of the buckets B j 0; B j 0 +1 ; : : :; B j?1 . Let us follow the processing speci c to bucket B j 0 . Having received the bounding pair (s (j 0 ?1)m ; s j 0 m ), each data register determines whether the value x stored in its element eld satis es (11) or (12) with j 0 in place of j and marks x accordingly. In a parallel write operation, the CU copies all the marked elements to the corresponding words of the next available memory row allocated to bucket B j . Next, using compare-and-set all the marked elements in the data registers are set to +1, and the short auxiliary elds are cleared. The same process is then repeated for all the remaining buckets with respect to which row r is special.
The reader will not fail to note that when the processing of row r is complete, each of its elements has been moved to the bucket to which it has been assigned. Moreover, by (9) and (10) 2 ) time for data movement and no calls to the sorting device. In Step 4, the buckets are sorted independently. If a bucket has no more than p 1 2 memory rows, it can be sorted in one call to the basic algorithm. Otherwise, the bucket is partitioned in two halves, each sorted in one call to the basic algorithm. Finally, the two sorted halves are merged using procedure MERGE TWO GROUPS. By Theorem 4, the task of sorting all the buckets individually can be performed in O(mp i?2 2 ) calls to the sorting device and in O(mp i?2 2 ) time for data movement not involving sorting.
Implementing Step 5
To motivate the need for the processing speci c to Step 5, we note that after sorting each bucket individually in Step 4, there may be a number of +1's in each bucket. We refer to such elements as empty; memory rows consisting entirely of empty elements will be termed empty rows. A memory row is termed impure if it is partly empty. It is clear that each bucket may have at most one impure row. A memory row that contains no empty elements is referred to as pure.
The task of coalescing the non-empty elements in the buckets into mp i?2 2 consecutive memory rows will be referred to as compaction. For easy discussion, we assume that all sorted buckets are stored in consecutive rows. That is, the non-empty rows of B 2 follow the non-empty rows of B 1 , the non-empty rows of B 3 follow the non-empty rows of B 2 , and so on, assuming that all empty rows have been removed. The compaction process consists of three phases.
Phase 1: Let C be the row sequence obtained by concatenating non-empty rows of B j 's obtained in Step 4 of MULTIWAY MERGE in the increasing order of their indices. We partition sequence C into subsequences C 1 ; C 2 ; : : :; C x such that each C j contains p Lemma 11 The preceding row of every impure row, except the last row, of E is a pure row.
Proof. We notice the following fact: except the last row of D, every row of D either contains at least p 1 2 non-empty elements, or if it contains fewer than p 1 2 non-empty elements then its preceding row must be a pure row. This is because that each row of C contains at least one non-empty element. An impure row of D can be generated under one of two conditions: (a) if C j contains fewer than p non-empty elements, then C 0 j contains only one row, an impure row, with its non-empty elements coming from p 1 2 impure rows of C j , and (b) if a C j contains more than p non-empty elements, then C 0 j contains only one impure row, and its preceding row is a pure row. The lemma directly follows from this fact. 2 Phase 2: This phase computes a set of parameters, which will be used in the next phase. Let w be the total number of (non-empty) rows in E. Assume that the rows of E are located from row 1 through row w. For every j, (1 j p i?2 2 ), we let n j stand for the number of non-empty elements in the impure memory row c j . The rst subtask of Phase 2 is to determine n 1 ; n 2 ; : : :; n p i?2 2 . Consider a generic impure row c j . To determine n j the CU reads the entire row c j into the data registers R 1 ; R 2 ; : : :; R p such that for every k, (1 k p), the c j -th word of memory module M k is read into register R k . The long auxiliary eld of data register R k is set to k. By using the compare-and-set feature, the CU instructs each register R k to reset this auxiliary eld to ?1 if the element it holds is +1 (i.e. empty). Next, the data registers are loaded into the sorting device and sorted in increasing order of their long auxiliary elds. It is easy to con rm that, after sorting, the largest such value k j is precisely the position of the rightmost non-empty element in memory row c j . Therefore, the CU sets n j = p + 1 ? k j . Consequently, the task of computing all the numbers n 1 ; n 2 ; : : :; n p 2 ) read/write operations and does not involve sorting. Once the numbers n 1 ; n 2 ; : : :; n w are available, the CU computes the pre x sums 1 = n 1 ; 2 = n 1 + n 2 ; : : :; j = n 1 + n 2 + : : :; +n j ; : : :; w = n 1 + n 2 + : : :; +n w . This, of course, involves only additions and can be performed by the CU in O(mp 
Complexity Analysis
With the correctness of our multiway merge algorithm being obvious, we now turn to the complexity. Speci cally, we are interested in assessing the total amount of data movement, not involving sorting, that is required by procedure MULTIWAY MERGE. Speci cally, let J(mp 
The Sorting Algorithm
With the basic algorithm and the multiway merge at our disposal, we are in a position to present the details of our sorting algorithm using a sorting device of xed I/O size p. The input is a set of N items stored, as
