Sorting networks of a xed I/O size p have been used, thus far, for sorting a set of p elements. Somewhat surprisingly, the important problem of using such a sorting network for sorting arbitrarily large data sets has not been addressed in the literature. Our main contribution is to propose a simple sorting architecture whose main feature is the pipelined use of a sorting network of xed I/O size p to sort an arbitrarily large data set of N elements. A noteworthy feature of our design is that no extra data memory space is required, other than what is used for storing the input. As it turns out, our architecture is feasible for VLSI implementation and its time performance is virtually independent of the cost and depth of the underlying sorting network. Speci cally, we show that by using our design N elements can be sorted in ( N p log N p ) time without memory access con icts. Finally, we show how to use an AT 2 -optimal sorting network of xed I/O size p to construct a similar architecture that sorts N elements in ( N p log N p log p ) time.
example, Batcher's classic bitonic sorting network and odd-even merge sorting network 4, 5] Leighton 13] and Paterson 21] developed comparator-based sorting networks of I/O size p, cost O(p), and depth O(log p) that sort p elements in O(log p) time. The AKS network is both cost-optimal and depth-optimal (i.e. time-optimal) in the context of sorting p elements with each comparator used once.
It is interesting to note that in spite of the fact that sorting networks of xed I/O size p have been extensively investigated in the context of sorting p elements, their e cient use for sorting a large number, say N, of elements has not received much attention in the literature. In real-life applications, the number N of input elements to be sorted is much larger than p. In such a situation, the sorting network must be used repeatedly, in a pipelined fashion, in order to sort the input elements e ciently. Assume that the input as well as the partial results reside in several constantport memory modules. Then, scheduling memory accesses and the I/O of the sorting network becomes the key to achieving the best possible sorting performance. Clearly, if an appropriate answer to this problem is not found, the power of a sorting network will not be fully utilized.
The problem of building sorting networks out of smaller components such as p-sorters and mergers has received attention in the literature 7, 8, 15, 20] . Bilardi A related problem, namely that of sorting N elements by repeatedly using a p-sorter, has received attention of late 6, 18, 20] . A p-sorter is a sorting device capable of sorting p elements in constant time. Computing models for a p-sorter do exist. For example, it is known that p elements can be sorted in O(1) time on a recon gurable mesh of size p p 11, 14, 16, 17] . A recon gurable mesh is a multiprocessor system in which the processors are connected by a bus system whose con guration can be dynamically changed to suit computational needs. Beigel and Gill 6] showed that the task of sorting N elements, N p, requires N log N p log p calls to a p-sorter. They also presented an algorithm to sort N elements using N log N p log p calls to the p-sorter. Their algorithm, however, assumes that the p elements to be sorted by the p-sorter can be fetched in unit time, regardless of their locations in memory. Since, in general, the address patterns of the operands of p-sorter operations are irregular, it appears that the algorithm of 6] cannot realistically achieve the time complexity of O N log N p log p , unless one can solve in constant time the addressing problem on realistic machines. To address this problem, Olariu and Zheng 18] proposed a p-sorter-based architecture that allows to sort N elements in O( N log N p log p ) time while stricly enforcing con ict-free memory accesses. In conjunction with the results of 6], their result completely resolves the time complexity issue of sorting N elements using a p-sorter. As it turns out, a p-sorter is a much more expensive device than a sorting network, and its use should be avoided whenever possible. Besides, it is not clear whether it is possible to replace the p-sorter with a pipelined sorting network in the architecture of 18], while guaranteeing the same performance.
The main contribution of this work is to propose a simple sorting architecture whose main feature is the pipelined use of a sorting network of xed I/O size to sort an arbitrarily large number N of elements. Speci cally, we show that by using our design, N elements can be sorted in ( N p log N p ) time. Our design consists of a sorting network of xed I/O size p, a set of p 2 randomaccess memory modules, and a control unit. The memory access patterns are regular: in one step, elements from two rows of memory modules are supplied as input to the sorting network and/or the output of the sorting network is written back into two memory rows. Our architecture is feasible for VLSI implementation. We then show how to use an AT 2 -optimal sorting network of xed I/O size p to construct a similar architecture that sorts an arbitrary number N of elements in ( N p log N p log p ) time. An important feature of both architectures is that no extra data memory space is required, other than what is needed for storing the input.
The remainder of the paper is organized as follows. In Section 2 we discuss the details of the proposed architecture. In Section 3 we show how to obtain row-merge schedules, a critical ingredient for the e ciency of our design. Section 4 extends the results of 
The Architecture
A sorting network can be modeled as a directed graph whose nodes represent processing elements and whose edges represent the links connecting the nodes, as illustrated in Figure 1 . The processing elements can be simple comparators or more complex processors capable of performing arithmetic operations. A comparator has two inputs and two outputs and is used to perform a compareexchange operation. A comparator-based sorting network is a sorting network whose processing elements are comparators. In the remainder of this work, we use the term sorting network to refer exclusively to a comparator-based sorting network. Figure 1 and Figure 2 illustrate Batcher's classic sorting networks, with I/O size 8. As illustrated in Figure 1 , two types of comparators are used. For a type 0 comparator, the smaller and larger of the two input numbers emerge, after comparison, at the top and bottom output, respectively. A comparator of type 1 produces the output in reverse order. Unless stated otherwise, we assume that when a sorting network of xed I/O size p is used to sort p elements, each of its comparators is used exactly once. A simple inductive argument shows that for every k, 2 k D(S), every comparator in layer L k receives at least one input from a comparator in layer L k?1 . Therefore, in a layered sorting network the longest paths from the network input to the comparators in layer L k must have the same length.
We say that a sorting network S is pipelined if for every k, 2 k D(S), all paths from the network input to the comparators in layer L k have the same length 1 . As an illustration, the bitonic sorting network shown in Figure 1 is a pipelined network, whereas the odd-even merge sorting network of Figure 2 is not. The intuition for this terminology is that a pipelined sorting network S of I/O size p can be used to sort sets of p elements concurrently in a pipelined fashion. It is easy to con rm that in a pipelined network S each layer contains exactly p For each output of a comparator c in layer k such that k < D(S) that is also the output of the network, we add D(S)?k latch nodes on the output edge. The reader should have no di culty con rming that in the resulting network all paths from the network input to the nodes in the same layer have the same length. Thus, this transformation converts a non-pipelined network into a pipelined one. For example, after adding latches to the odd-even merge sorting network of Figure 2 , we obtain the network shown in Figure  3 .
Our proposed architecture, that we call the Row-Merge Architecture, (RMA, for short) is illustrated in Figure 4 for p = 8. The RMA involves the following components: The length of a path is taken to be the number of edges on the path. rows. Dummy elements of value +1 are added, if necessary, to ensure that all memory modules contain the same number of elements: these dummy elements will be removed after sorting. The read/write operations are carried out in a single instruction (address) stream multiple data stream fashion controlled by the CU. Speci cally, the CU is responsible for generating memory access addresses: in every step, the same address is broadcast to all memory modules which use it as the local address for the current read or write operation. We assume that the address broadcast operation takes constant time. The CU can disable memory read/write operations, as necessary, by appropriately using a mask.
When operating in pipelined fashion, in a generic step i, p elements from two memory rows are fed into the sorting network; at the end of step i + D(T ) + 1 the sorted sequence of p elements from these two rows emerges at the output ports of T and is written back into two memory rows.
This process is continued until all the input elements have been sorted. To simplify our analysis, we assume that one memory cycle is su cient for reading, writing, and for comparator operations. This assumption is reasonable if each memory module has, say, two ports for reading and two ports for writing. If each memory module has only one port for both reading and writing, the performance degrades by a small constant factor.
Let (a; b) be an ordered pair of memory rows in data memory. In the process of sorting, the elements in memory row a (resp. b) are read into the left (resp. right) half of the network input, and the corresponding elements are sorted in non-decreasing order. Finally, the left (resp. right) half of the resulting sorted sequence emerging at the network output is written back into data memory to replace the original row a (resp. b). It is now clear why we refer to our design as the Row-Merge Architecture.
In order to e ciently sort 2N p memory rows on the RMA, we wish to identify a nite sequence MS =< (a 1 ; b 1 ); (a 2 ; b 2 ); ; (a s ; b s ) > of pairs of memory rows such that by following this sequence the elements are sorted in row-major order. We call such a sequence MS a row-merge schedule or, simply, a merge schedule. At this time, the reader may wonder about the power of the RMA. In Theorem 1, we provide a partial answer to this question by establishing a lower bound on the time required by any algorithm that sorts N elements using the RMA. to right order, the pairs of memory rows that will be supplied as input to the sorting network, in a pipelined fashion. For example, the ordered pair (a 1 ; b 1 ) is supplied in the rst time unit, followed by the ordered pair (a 2 ; b 2 ) in the second time unit, and so on. For reasons that will be discussed later, we are interested in merge schedules that satisfy the following three conditions:
(1) The RMA must sort correctly, if a p-sorter is used instead of the sorting network, that is, for sorting networks of depth one. Assume that the N elements to sort are located in memory rows 1 through m. We generate a merge schedule MS from the line representation and the layer partition of S by the following greedy algorithm. Initially, the inputs to all comparators are unmarked. Let C 1 be an arbitrary FIFO ( rstin rst-out) queue of comparators at level L 1 . We obtain a FIFO queue C i+1 of comparators in level L i+1 as follows: Set C i+1 to empty. Scan the comparator queue C i in order and for each comparator in C i , mark its two output edges. As soon as the two input edges of a comparator c are marked, include c into layer C i+1 . At this point the reader will not fail to note that comparator c must, indeed, belong to layer L i+1 . This process is continued, as described, until all C j s, 1 j D(S), have been constructed. Finally, concatenate the C j s to obtain a sequence C of comparators such that C i precedes C i+1 .
Let C = (c k 1 ; c k 2 ; ; c ks ), s = C(S), be the sequence of comparators obtained from S using the (1) and (2) above and, therefore, can be used to correctly sort N elements on the RMA. However, there exist sorting networks that cannot be used to generate a merge schedule that satis es condition (2) for D(T ) 3. This fact restricts the applicability of the MS generating scheme. For example, if D(T ) 3, no MS generated directly from the network featured in Figure 7 can satisfy condition (2). To remedy this problem, we introduce the concept of augmenting sorting networks. Figure 2 is given in Figure 8 .
We will still use our greedy algorithm to generate an MS from an augmented network S 00 .
The comparator selection process is exactly as described above. However, the task of translating a comparator sequence into the corresponding MS is slightly modi ed to accommodate dummy Operating in this fashion, an MS generated from an augmented network S 00 clearly satis es condition (1) because of the compare-exchange/merge-split principle (Propositions 1). As we shall prove in Theorem 2, the MS also satis es condition (2) . The length of the MS is the cost of the sorting network using which the MS is generated. Both S and S 00 have the same depth, but S 00 has an increased cost compared with S. We note that, at rst, it would seem as though by using S 00 to derive an MS condition (3) will not be satis ed. However, all sorting networks S of I/O size m, known to the authors, including the network. featured in Figure 7 , have O(mD(S)) cost; therefore, the cost of S 00 is within a constant factor of the cost of S.
To summarize our ndings we now state and prove the following important result. Conceptually, we treat the network S 00 as a data-driven (i.e. data-ow) architecture: the processing elements are precisely the comparators, whose activation is driven by data availability. We say that a comparator c is ready for activation whenever its two inputs are available and it has not yet been used. To prove the theorem, we need to show that for any j such that all comparators preceding c k j have been used but c k j has not yet been used, all the comparators in the subsequence ; ; c k t?1 in C. With this, we just proved that this MS can be used as a merge schedule for network T . Note also that by our previous discussion the time required to sort N elements is O(s), where s is the cost of S 00 , and it is bounded above by O(mD(S)). This completes the proof of the theorem. , any row-merge schedule generated from the augmented network of any network S by our greedy algorithm can be used to sort N elements correctly; in other words, the correctness of any merge schedule is independent of the sorting network S used to generate it.
(d) The time required for sorting N elements is proportional to D(S), the depth of S.
We can select T from a wide range of sorting networks, depending on their VLSI feasibility.
We also have a wide range of sorting networks to choose from for deriving merge schedules. We It is interesting to note that, even if the MS schedule is available, the RMA needs O( N p D(S)) time to complete the task of sorting N elements. Thus, the time it takes the CU to compute an MS and the time needed by the network T to perform the sorting are perfectly balanced. In other words, the time complexity claimed in Theorem 3 also holds if the computation required for generating an MS is taken into account. It is very important to note that once available, the MS can be used to sort many problem instances.
The working space requirement by the control memory is proportional to mD(S) words, each of O(log N) bits. Rather remarkably, the RMA does not require extra data memory space other than what is used for storing the input. 4 The Generalized Row-Merge Architecture
In a number of contexts, especially when the VLSI complexity of T is a concern, it is desirable to use an AT 2 -optimal network as a parallel sorting device. The main goal of this section is to show that it is possible to design a sorting architecture that uses an AT 2 -optimal sorting network as its parallel sorting device. As it turns out, the time performance of the new design, that we call the Generalized Row-Merge Architecture (GRMA, for short), is slightly better than that of the RMA discussed in Section 2.
The GRMA uses a sorting network T of xed I/O size p with inputs I 1 ; I 2 ; : : : I p and outputs O 1 ; O 2 ; : : : O p . It has p constant-port data memory modules M 1 ; M 2 ; : : : M p , collectively referred to as the data memory. For every k, 1 k p, memory module M k is connected to input I k and to output O k . In one parallel read operation, one memory row is read and supplied as input to T ; in one parallel write operation the output of T is written back into one memory row. Just like in the RMA, the memory accesses and the operation of the sorting network T are controlled by the control unit (CU). There are, however, three major di erences between the GRMA and the RMA. (c) For simplicity, we assume that the GRMA operates in a di erent pipelining mode than the RMA. Speci cally, a group of r memory rows are fed into the network in r consecutive time steps and, after sorting, the r rows are written back to memory in r consecutive time steps. After that, another group of r memory rows is fed into the network, and so on. This process is repeated until the elements in all groups of r memory rows are sorted. The value r is proportional to the depth D(T ) of T . (We note here that by changing the control mechanism, the GRMA can also operate in fully pipelined mode, i.e. the network T can be fed continuously.)
We select for T Leighton's optimal sorting network 13] which is known to be AT 2 -optimal. This network, which is a hardware implementation of the well known Columnsort algorithm, has I/O size q log q and depth c log q, where c is a constant greater than 1. Two designs were proposed in 13]: one with a value of c signi cantly smaller than that of the other. Leighton's sorting network sorts an array of size loglog q in row-major order in a pipelined fashion. Speci cally, in each of the rst log q steps, q log q elements are fed into the network, and after c log q steps, these elements emerge, sorted, at the output of the network in log q consecutive steps.
Let q be such that p = q log q . We partition the N p memory rows into m = By the compare-exchange/merge-split principle, the GRMA sorts N elements correctly. Since each iteration takes O(log q) time, the task of sorting N elements on the GRMA using this MS Notice that in the case of the GRMA the computation of a merge schedule does not require an augmented network. The length of the merge schedule is somewhat shorter because of using a network S of smaller I/O size and depth, and without dummy comparators. Again, no extra data memory is required other than what is used for storing the input.
Conclusions and Open Problems
The main motivation for this work was provided by the observation that, up to now, sorting networks of xed I/O size p have only been used to sort a set of p elements. Real-life applications, however, require sorting arbitrarily large data sets. Rather surprisingly, the important problem of using a xed I/O size sorting network in such a context has not been addressed in the literature. Our main contribution is to propose a simple sorting architecture whose main feature is the pipelined use of a sorting network of xed I/O size p for sorting an arbitrarily large data set of N elements. A noteworthy feature of our design is that it does not require extra data memory space, other than what is used to store the input. As it turns out, the time performance of our design, that we call the Row-Merge Architecture (RMA) is virtually independent of the cost and depth of the underlying sorting network. Speci cally, we showed that by using the RMA N elements can be sorted in ( N p log N p ) time, without memory access con icts. In addition, we showed how to use an AT 2 -optimal sorting network of xed I/O size p to construct a similar architecture, termed Generalized Row-Merge Architecture (GRMA) that sorts N elements in ( N p log N p log p ) time. At this point, we do not know whether or not a better performance can be achieved by removing the restriction on the rigid memory access scheme of the RMA, by allowing more exible, yet regular, memory access patterns. In such a case, the time lower bounds for both of the RMA and the GRMA no longer hold. As the results of 6] indicate, N elements cannot be sorted in less than N log N p log p time using any parallel sorting device of I/O size p. It is an interesting open question to close the gap between this lower bound and the time performance o ered by our designs.
The best performance of the designs proposed in this paper is proportional to the depth of the AKS network, which is used to construct merge schedules. The constant associated with the depth complexity of the AKS network is too large to be considered practical. However, our results reveal the potential of row-merge based simple sorting architectures.
Along this line of thought, a long-standing open problem is to obtain a realistic sorting network of logarithmic depth. It is equally important to discover a network of depth c log m log log m, where m is the network I/O size and c is a small constant. Such networks are useful for deriving practically short merge schedules.
