A new algorithm and a new modular architecture are presented for the realization of high-speed binary sorting engines, based on efficient rank ordering. Capacitive Threshold Logic (CTL) gates are utilized for the implementation of the multi-input programmable majority (voting) functions required in the architecture. The overall complexity of the proposed bit-serial architecture increases linearly with the number of input vectors to be sorted (window size = m) and with the bit-length of the input vectors (word size = n), and the sorter architecture can be easily expanded to accommodate large vector sets. Detailed simulations indicate that the sorter structure can operate at sampling clock rates of up to 50 MHz, where the throughput is boosted by fine-grain pipelining. It is demonstrated that the proposed sorting engine is capable of producing a fully sorted output vector set in (m+n-1) clock cycles.
INTRODUCTION
The task of sorting an arbitrarily ordered vector set according t o magnitude (either from-largest-to-smallest or from-smallest-to-largest) is one of the fundamental operations required in many digital signal processing applications. It is also an expensive operation in terms of area-time complexity; software-based solutions require word-level sorting and can become computationally intensive, while the overall complexity of hardware-based solutions usually increases very rapidly with the size of the input vector set (number of vectors) and with the bit-length of the input vectors [5], [l], [3] . The design of efficient sorting engine architectures is therefore a significant challenge for overcoming the computational bottleneck of the binary sorting problem. A number of recent proposals for the realization of sorting networks rely primarily on median or rank order filters (ROF), yet their capabilities in terms of window size and bit-length are typically limited due to rapidly increasing hardware complexity 151, [6], [a] .
In this work, we present a new bit-serial sorting algorithm based on rank-ordering. The hardware realization of this algorithm results in a compact and fully modular sorting engine architecture that is capable of processing a large number of input vectors in linear time (Fig. 1) . The overall architecture is completely scalable to accommodate a wide range of window sizes and bit-lengths, and the hardware complexity only grows linearly with both of these parameters. The proposed sorter architecture is essentially based on a fully programmable modular ROF design that was presented earlier [9]. In the following, we first present the rank ordering architecture in Section 2, followed by the proposed sorting algorithm and its hardware realization. 
THE RANK ORDERING ARCHITECTURE
A bit-serial algorithm first proposed in 151 was chosen as the basis of the programmable rank-order filter architecture implemented in this work. In this algorithm, the problem of finding a rank-order-selection for n-bit long words is reduced to finding "n" rank-order-selections for 1-bit numbers.
The algorithm starts by processing the most significant bits (MSB) of the m=(2N + 1 ) words in the current window, through an m-input programmable majority gate, to yield the MSB of the desired filter output. This output is then compared with the other MSBs of the window elements. The vectors whose MSB is not equal to the filter output have their MSB propagated down by one position, replacing the less significant bits of the corresponding words.
The bit-serial operation flow of the algorithm described above suggests a simple bit-level pipelined data path architecture, consisting of data modifier-propagator blocks to handle fine-grained data selection, and majority decision blocks to determine output bits.
A programmable rank-order filter of any window size and bit-length can be realized by using the two main blocks described above. The bit-length dictates the number of the majority decision gates, whereas the window size determines the number of ROF-cells driving one of these majority gates. This regular structure results in the ROF core array shown in Fig. 2 . The programmable majority decision gates are realized using the capacitive threshold logic (CTL) circuit architecture presented earlier [4] . This allows simple implenientation of programmable majority gates with up to 63 parallel inputs, using a very small silicon area (625pm x 130pm for 63-bit majority gate).
l r
Rent Bus

THE SORTING ALGORITHM
The proposed sorting algorithm is a bit-serial algorithm, whose input is a window of "m" n-bit words. The output corresponds to a sequence of the input vectors in a desired rank order. It starts by processing the MSBs of the "m" input vectors in the current window. Each bitplane has its own rank value which is used t o calculate the corresponding slice output. The pseudo-code of the proposed sorting algorithm is given below. The algorithm involves two loops; the outer loop initializes the rank value for the next iteration and check if the sorting operation is finished, whereas the inner loop does the actual sorting operation by performing parallel instructions on "n" bit-planes. Listing of the proposed sorting algorithm.
The very first step of the algorithm is to set the rank value of the most-significant bit-plane to the first desired rank (FrstDeszredRank), whose value depends on The signal flow between the ROF cells and the majority gates are also shown in Figure 2 . The modular architecture consisting of only two major blocks enables fully scalable construction of filter structures of arbitrary size. It also forms the basis of the sorting algorithm described in the next section. ing or descending order. For example, if we consider the case of sorting the input vectors in ascending order; at the first iteration of the main operation loop (inner loop), the rank value corresponding to the mostsignificant bit-plane ( r a n k o f ( 1 ) ) has to be set to "smallestRank", which results in filtering out the smallest input word. Also, the rank values for the next iterations ( n e x t D e s i r e d R a n k ) are determined by the sorting direction. If sorting is in ascending order, the r n n k o f ( 0 ) will be assigned " r a n k o f ( 1 ) + I", until the rank value corresponding to the MSB-plane will be equal t o the upper rank value (ZastDesiredRank), which will be the "1argestRank". It should also be noted that the algorithm can be used for sorting the input vectors in any desired order. In this case, a look-up table may be used to provide the necessary sequence of rank values.
The operations contained in the inner loop are performed at the same time on all bit-planes. After the "m" bits in each bit-plane are arranged either by shifting or by shifting&rotating, the corresponding bit-plane output is calculated by evaluating all of the bits in each bitplane according to the current rank value ("rank-order"). The algorithm is finished after the bit-plane corresponding t o the least-significant bits is processed with the last rank value (1astDesiredRank).
The operation of the proposed sorting algorithm is illustrated with an example in Fig. 3 . Here, five 4-bit vectors (A through E) are being sorted by the ROF core. The first rank (Rl) is initialy applied to the MSB plane consisting of the bits A1 through E l . In the next clock cycle, the same rank is used to process the lessersignificant bit-plane (A2 through E2), while a new rank (R2) is being applied to the MSB plane. Also note that the staggered data bits are gradually circulated from the end of the chain to the front, so that each vector in the window can be completely processed. The entire operation requires only (in+n-1) clock cycles after all input vectors are applied. It is important to note that the time-complexity of the sorting operation described above has a linear dependence both with respect to window size (m) and with respect to bit-length (n). 
REALIZATION OF THE SORTING ENGINE
The proposed sorter architecture exploits the fact that the modular ROF core described in Section 2 is capable of generating one output vector per clock cycle, corresponding to the currently selected rank. If the ranking process is repeated on the same set of vectors instead of processing a continuous stream of new vect,ors, the members of the vector set can be sorted in linear time by simply changing (increasing or decreasing) the rank in each clock cycle. The overall architecture of the sorting engine is shown in Fig. 4 . The flow of data through the modular ROF core is being regulated by complementary input and output shift registers, which are used to stagger the individual bit-planes of each input vector to enable bit-level pipelined operation. The multiplexer on the input side is used for accepting the input vectors at the rate of one vector per clock cycle, as well as for circulating (rotating) the data until sorting is completed. The control logic is responsible for regulating the data circulation path, and for applying the rank selection signals to the individual bit-planes, in ascending or descending order. The fact that each individual bitplane is capable of processing a different rank at any given time significantly increases the overall efficiency of this architecture. In a typical sorting run, the control logic simply requests each bit-plane to process a different rank in each clock cycle, either beginning from the maximum rank and descending, or beginning from the minimum rank and ascending. The proposed architecture has been described with VHDL to verify its operation. Fig. 5 shows simulated results of two sorting operations on an arbitrarily ordered set of eight vectors, each with a bit-length of 8 bits. It can be seen that the first output vector is generated with a latency of (n-1) clock cycles, after the last vector of the set is entered. Figure 5 : Simulation results of a ranking operation on an arbitrarily ordered set of eight vectors. The input set is being sorted in descending order from maximum (233) to minimum (11) value. The input set can also be sorted in ascending order from minimum to maximum value, simply by changing the rank sequence applied to the majority gates of each bit-plane. 
