Abstract
Introduction
The task of sorting an arbitrarily ordered vector set according to magnitude (either from-largest-to-smallest or h.om-smallest-to-largest) is one of the fundamental operations required in many digital signal processing applications. The solutions of many fundamental computer science problems like searching, finding the closest-pair, and frequency distribution etc., also .depend on the efficient implementation of this function.
Sorting is an expensive operation in terms of areatime complexity; software-based solutions require wordlevel sorting and can become computationally intensive, ,while the overall complexity of hardware-based solutions usually increases very rapidly with the size of the input vector set (number of vectors) and with the bit-length of the input vectors [I] , [2] . The design of efficient sorting engine architectures is therefore a significant challenge for overcoming the computational bottleneck of the binary sorting problem. A number of recent proposals for the realization of sorting networks rely primarily on median or rank order filters (ROF), yet their capabilities in terms of window size and bit-length are typically limited due to rapidly increasing hardware complexity
In this paper, we present a new bit-serial sorting architecture based on rank-ordering. The hardware realization of this architecture results in a compact and fully modular sorting engine architecture that is capable of processing a large number of input vectors in linear time. The overall architecture is completely scalable to accommodate a wide range of window sizes and bitlengths, and the hardware complexity only grows linearly with both of these parameters. The proposed sorter architecture is essentially based on a fully programmable modular ROF design that was presented earlier [5], [6] . To our knowledge, this represents the first demonstration of a linear-complexity sorter architecture on silicon. The sorting algorithm and the corresponding architecture are presented in Section 2 and Section 3. The full-custom CMOS realization of the majority function and the overall sorting engine architecture are discussed in Sections 4 and 5 , followed by a summary of results.
~21, [31, [41. 
The Sorting Algorithm
A bit-serial algorithm proposed in [6] was chosen as the basis of the programmable sorter implemented in this work. In this algorithm, the problem of finding a rankorder-selection for n-bit long words is reduced to finding 'hO rank-order-selections for 1 -bit numbers.
The algorithm starts by processing the most significant bits (MSB) of the m = (2N + 1) words in the current window, through an m-input programmable majority function, to yield the MSB of the desired output. This output is then compared with the other MSBs of the window elements. The vectors whose MSB ,is not equal to the output have their MSB propagated down by one position, replacing the less significant bits of the corresponding words. Thus, the bit-serial sorting algorithm operates with an input set (window) of "m" n-hit words. The output corresponds to a sequence of the input vectors in a desired rank order. Note that each bit-plane can be independently assigned its own rank value which is used to calculate the slice output. A detailed explanation of the basic sorting algorithm can he found in [61 and [ 7 ] .
v-453
0-7803-7761-3/031$17.00 02003 IEEE
The Sorting Engine Architecture
The bit-serial operation flow of the algorithm described above suggests a simple bit-level pipelined data path architecture, consisting of data modifierpropagator blocks (ROF cells) to handle fine-grained data selection, and majority decision blocks (majority function) to determine output bits. The modular twodimensional array architecture consisting of these two major blocks enables fully scalable construction of structures of arbitrary window size and bit-length. The bit-length dictates the number of the majority decision gates (rows), whereas the window size determines the number of ROF-cells driving one of these majority gates (columns). The structure of one row is shown in Fig. 1 and the circuit diagram of a ROF cell is shown in Fig. 2 . Note that the circuit complexity of the ROF cell is fairly limited, resulting in a very compact realization. (2N + 1) is the window size and n is the bit-length of the input words (vectors). Thus, it can be seen that the overall circuit complexity increases linearl)) with maximum window size (m) and with bit-length (n).
The proposed sorter architecture exploits the fact that the modular core described here is capable of generating one output vector per clock cycle, corresponding to the currently selected rank. If the ranking process is repeated on the same set of vectors instead of processing a continuous stream of new vectors, the members of the vector set can be sorted in linear time by simply changing (increasing or decreasing) the rank in each clock cycle. The circuit structure and the signal flow of one sorter bit slice that is designed to implement the bitlevel operations described above will result in a regular, expandable structure, as seen in Fig. 1 . The multiplexer on the input side is used for accepting the input vectors at the rate of one vector per cluck cycle, as well as for circulating (rotating) the data until sorting is completed. The so-called "sorter core'' is simply constructed by stacking "n" such bit-slices.
The overall architecture of the sorting engine is shown in Fig. 3 . The flow of data through the modular ROF core is being regulated by complementary input and output shift registers, which are used to stagger the individual bit-planes of each input vector to enable bitlevel pipelined operation. The control logic is responsible for regulating the data circulation path, and for applying the rank selection signals to the individual hit-planes, in ascending or descending order. The fact that each individual bit-plane is capable of processing a different rank at any given time significantly increases the overall efficiency of this architecture. In a typical sorting run, the control logic simply requests each bitplane to process a different rank in each clock cycle, either beginning from the maximum rank and descending, or beginning from the minimum rank and ascending. The proposed architecture has been described with VHDL to verify its operation. Fig. 4 shows simulated results of the sorting operation on an arbitrarily ordered set of 15 vectors (m=15), each with a hit-length of 8 hits (n=8). The user determines how many input vectors are to be sorted ("actualWindowSize", not shown in Fig. 4 ) and in which direction the sorting will occur ("sortType") and provides these inputs to the sorter block together with a request pulse ("sortRequest"). As soon as the request comes, the sorter block produces signal ('TortActive") which stays at the logic high level as long as the corresponding set of vectors is processed. It can be seen that the first output vector is generated with a latency of (n-I) clock cycles, after the last vector of the set is entered. The sorter block provides a signal to the user ("outputsValis') which goes high right at the last rising edge of the clock before the first vector is ready at the output ("sortDataOutput").
v-454

Realization of the Majority Function
The programmable majority (voting) function is the key operation that must be performed in each row. This function also determines the overall operation speed (Le., the clock frequency), since a 63-input majority function must be performed in each row, during each clock cycle. Note that, the other operations described in the previous sections only involve data transfers from one ROF cell to the next, thus, they do not represent a critical bottleneck in terms of the time budget.
The 63-input programmable majority block has been realized with a fully combinational parallel counter that consists of 51 full adders connected in a tree-network, and an output comparator network that consists of 21 basic logic gates. Overall, the worst-case logic depth of the entire majority block is equivalent to 8 full adders in cascade. Consequently, the input-to-output delay of the programmable majority function is smaller than 4.5 ns. 
v-455
I.* tin. < * ) Figure 6 Simulation results of the 63-bit programmable majority function block, where 32 inputs are assigned logic-I and 31 inputs are assigned logc-0 (altemating). The worst-case propagation delay is smaller than 4.5 ns.
Realization of the Sorter Architecture
The binary sorting engine architecture designed to process 63 input vectors of 16-bits (m = 63, n = 16) has been realized using conventional 0.35 pm CMOS technology. The architecture consists of 16 rows, where each row is capable of processing 6 3 bits simultaneously.
To reduce signal propagation paths and to simplify a balanced clock distribution, the rows were designed with a folded geometry, and the 16 rows were divided in two main columns, with 8 rows each. The top level layout of the chip is shown in Figure 7 . The input and output shift registers were distributed to each row to improve area utilization. A four-level balanced clock buffer network was used to distribute the system clock uniformly throughout the chip, with minimum skew. One 63-bit cell row and one 63-bit majority network are also highlighted in the layout to indicate their relative areas.
Conclusion
In this paper, we present a highly modular architecture for the realization of high-speed binary sorting engines. The architecture consists of (i) a regular "core" array that is completely scalable to accommodate large window sizes and bit-lengths, (ii) input/output shift registers, and (iii) control logic to regulate the bit-level processing of data. It was shown that the complexity of the proposed bit-serial pipelined architecture increases linearly with the number of input vectors (m) to be sorted, and with bit-length of the input vectors (n). It was also demonstrated that the proposed sorting engine is capable of producing a fully sorted output vector set in (mtn-I) clock cycles, i.e., in linear time.
A full-custom sorting engine chip was realized to 
