Scalable Binary Sorting Architecture Based on Rank Ordering With Linear Time Complexity by Hatirnaz, I. & Leblebici, Y.
Scalable Binary Sorting Architecture Based on Rank Ordering
With Linear Area-Time Complexity
. Hatrnaz and Y. Leblebici
Department of Electrical and Computer Engineering
Worcester Polytechnic Institute
Abstract
A new modular architecture is presented for the realiza-
tion of high-speed binary sorting engines, based on e-
cient rank ordering. Capacitive Threshold Logic (CTL)
gates are utilized for the implementation of the multi-input
programmable majority (voting) functions required in the
architecture. The overall complexity of the proposed bit-
serial architecture increases linearly with the number of
input vectors to be sorted (window size = m) and with
the bit-length of the input vectors (word size = n), and
the sorter architecture can be easily expanded to accom-
modate large vector sets. It is demonstrated that the pro-
posed sorting engine is capable of producing a fully sorted
output vector set in (m+n-1) clock cycles, i.e., in linear
time.
1 Introduction
The task of sorting an arbitrarily ordered vector set ac-
cording to magnitude (either from-largest-to-smallest or
from-smallest-to-largest) is one of the fundamental oper-
ations required in many digital signal processing appli-
cations. It is also among the best studied problems in
computer science, with a variety of dierent algorithms
developed for this purpose. Many fundamental computer
science problems like searching, nding the closest-pair,
and frequency distribution etc., become easy to solve once
a set of items is sorted (Fig. 1).
Sorting is an expensive operation in terms of area-time
complexity; software-based solutions require word-level
sorting and can become computationally intensive, while
the overall complexity of hardware-based solutions usually
increases very rapidly with the size of the input vector set
(number of vectors) and with the bit-length of the input
vectors [1], [3], [5]. The design of ecient sorting engine
architectures is therefore a signicant challenge for over-
coming the computational bottleneck of the binary sorting
V2
V1 V7
V9
V3
V6
V8
V4
V5
V1
V2
V3
V4
V5
V6
V7
V8
V9
V3
V5
V4 V8
V2
V1V7
V6
V9
The smallest vector
Ascending order sort
The largest vector
sorting of the
window elements
Window containing m input vectors.
The smallest vector
Descending order sort
The largest vector
Figure 1: Illustration of the sorting process (1-D).
problem. A number of recent proposals for the realization
of sorting networks rely primarily on median or rank order
lters (ROF), yet their capabilities in terms of window size
and bit-length are typically limited due to rapidly increas-
ing hardware complexity [2], [5], [6] .
In this paper, we present a new bit-serial sorting archi-
tecture based on rank-ordering. The hardware realization
of this architecture results in a compact and fully modular
sorting engine architecture that is capable of processing a
large number of input vectors in linear time. The overall
architecture is completely scalable to accommodate a wide
CLK
D
Q
CLK
D
Q
CLK
D Q
MUX
Next_SelectNext Data
Majority_Output
Shifted_Data_Out
CLK
Majority_Output
Shifted_Data_In
BUFFER
Propagated_Data_Out Prev_Data Prev_Select
(to Majority Gate)
CELL
ROF MAJORITY
DECISIONCELL
ROF
CELL
ROF MAJORITY
DECISIONCELL
ROF
CELL
ROF
CELL
ROF MAJORITY
DECISIONCELL
ROF
CELL
ROF
CELL
ROF
2
2 2 2
p
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

















m22
m
m222
m
Rank Bus
data(0)
data(m-2)
data(m-1)
coreOut(m-2)
coreOut(m-1)
coreOut(0) ROF CELL
Figure 2: The ROF core as proposed in the modular rank ordering architecture, the gate-level structure of a ROF cell,
and the corresponding layout, allowing modular expansion.
range of window sizes and bit-lengths, and the hardware
complexity only grows linearly with both of these parame-
ters. The proposed sorter architecture is essentially based
on a fully programmable modular ROF design that was
presented earlier [7]. In the following, the rank ordering
architecture is presented in Section 2. The proposed sort-
ing algorithm and its hardware realization are examined
in Section 3 and 4, followed by a summary of results.
2 The Rank Ordering Architecture
A bit-serial algorithm proposed in [5] was chosen as the
basis of the programmable rank-order lter architecture
implemented in this work. In this algorithm, the problem
of nding a rank-order-selection for n-bit long words is
reduced to nding n rank-order-selections for 1-bit num-
bers.
The algorithm starts by processing the most signicant
bits (MSB) of the m=(2N + 1) words in the current win-
dow, through an m-input programmable majority gate,
to yield the MSB of the desired lter output. This out-
put is then compared with the other MSBs of the window
elements. The vectors whose MSB is not equal to the l-
ter output have their MSB propagated down by one posi-
tion, replacing the less signicant bits of the corresponding
words.
The bit-serial operation ow of the algorithm described
above suggests a simple bit-level pipelined data path ar-
chitecture, consisting of data modier-propagator blocks
(ROF Cells) to handle ne-grained data selection, and ma-
jority decision blocks (majority gate) to determine output
bits. The modular architecture consisting of these two
major blocks enables fully scalable construction of lter
structures of arbitrary window size and bit-length (Figure
2). The bit-length dictates the number of the majority
decision gates, whereas the window size determines the
number of ROF-cells driving one of these majority gates.
This regular structure also forms the basis of the sorting
algorithm described in the next section.
The ROF core shown in Figure 2 has (nm) ROF cells
where m = (2N + 1) is the window size and n is the bit-
length of the input words (vectors). Thus, it can be seen
that the overall circuit complexity increases linearly with
maximum window size (m) and with bit-length (n).
A prototype ROF circuit has been designed and fabri-
cated using a 0.8 m CMOS process, to validate the main
operation principles of the ROF architecture. Detailed
measurement results of this circuit indicate that the new
architecture can accommodate sampling clock rates up to
50 MHz [7]. In this circuit, the programmable majority
decision gates are realized using the capacitive threshold
logic (CTL) circuit architecture presented earlier [4]. This
allows simple implementation of programmable majority
gates with up to 63 parallel inputs, using a very small
silicon area (625m x 130m for 63-bit majority gate).
In comparison, a classical realization of the 63-bit major-
ity gate would require an equivalent of 63 6-bit full-adder
circuits, arranged in a network of a logic depth of 64 (syn-
thesized from HDL description).
3 The Sorting Algorithm
In the following, we propose a bit-serial sorting algorithm
with an input set (window) of m n-bit words. The output
corresponds to a sequence of the input vectors in a desired
rank order. The algorithm starts by processing the MSBs
of the m input vectors in the current window. Note that
each bit-plane can be independently assigned its own rank
value which is used to calculate the slice output.
The pseudo-code of the proposed sorting algorithm is
given below. The algorithm involves two loops; the outer
loop initializes the rank value for the next iteration and
checks if the sorting operation is nished, whereas the
inner loop does the actual sorting operation by performing
parallel instructions on n bit-planes.
rankof(1):= firstDesiredRank;
do{
-- Rank initialization
if (rankof(1) != lastDesiredRank)
rankof(0):= nextDesiredRank;
-- Main operation starts
for all bit-planes
do{
if (all_bits(selectedBitplane) are in the core)
then shift_rotate(selectedBitplane);
else
shift(selectedBitplane);
end if;
selectedRank := rankof(selectedBitplane);
outputWordVector(selectedRank, selectedBitplane) :=
rank_order(selectedBitplane, selectedRank);
rankof(selectedBitplane) :=
rankof(selectedBitplane - 1);
}
}while (rankof(n) != lastDesiredRank)
Listing of the proposed sorting algorithm.
The very rst step of the algorithm is to set the rank
value of the most-signicant bit-plane to the rst desired
rank (rstDesiredRank), whose value depends on whether
the input vectors are to be sorted in ascending or descend-
ing order. For example, if we consider the case of sorting
the input vectors in ascending order; at the rst iteration
of the main operation loop (inner loop), the rank value cor-
responding to the most-signicant bit-plane (rankof(1))
has to be set to smallestRank , which results in ltering
out the smallest input word. Also, the rank values for the
next iterations (nextDesiredRank) are determined by the
sorting direction, that are stored in a register (rankof(0)).
If the vectors are to be sorted in ascending order, the value
of rankof(0) is increased until the rank value correspond-
ing to the MSB-plane will be equal to the upper rank value
(lastDesiredRank), which will be the  largestRank . So,
at each step, the value of rankof(0) is assigned to the
MSB slice (rankof(1)), where as the rank value of each
slice is shifted to one lesser-signicant bit-slice. It should
also be noted that the algorithm can be used for sorting
the input vectors in any desired order. In this case, a look-
up table may be used to provide the necessary sequence of
rank values to the sorter engine core.
E1 D1 C1 B1 A1
E2 D2 C2 B2 A2
E3 D3 C3 B3 A3
R1
R2 R3 R4 R5
C1
E4 D4 C4 B4 A4
E1 D1 C1 B1A1
E2 D2 C2 B2 A2
E3 D3 C3 B3 A3
R1
R2
R3 R4 R5
E4 D4 C4 B4 A4
E1 D1 C1B1 A1
E2 D2 C2 B2A2
E3 D3 C3 B3 A3 R1
R2
R3
R4 R5
C1
E4 D4 C4 B4 A4
E1 D1C1 B1 A1
E2 D2 C2B2 A2
E3 D3 C3 B3A3
R1
R2
R3
R4
R5
C1
E4 D4 C4 B4 A4
A1
C2
A1
C2
D1
A2
C3
C1A1
C2
D1
A2
C3
B1
D2
A3
C4
Inputs Sorter-Core Rank Outputs
(a)
(b)
(c)
(d)
Figure 3: Illustration of a sorting operation on ve 4-bit
input vectors: (a) The staggered input vectors are shifted
into the ROF core, and the rst rank (R1) is applied to the
MSB plane. (b) The MSB of the rst input vector (A1)
is rotated, R1 is applied to the next bit-plane, and the
new rank R2 is applied to the MSB plane. (c) B1 and A2
are rotated, while R1 is applied to the lesser-signicant bit-
plane. The rank R2 shifts down by one, while R3 is applied
to the MSB plane. (d) Bit circulation continues, while the
ranks propagate down the bit-planes in descending order.
In this example, the rank ordering of the input vectors
is assumed to be: C (largest vector)-A-D-B-E (smallest
vector).
The operations contained in the inner loop are per-
formed at the same time on all bit-planes. After the m
bits in each bit-plane are arranged either by shifting or by
shifting and rotating, the corresponding bit-plane output
QD
D
FF
Select signal
from the
control block
shift registers
from the input
Shifted data
Propagating data
from the bit-slice
Select signal
for the next
lower bit-slice
to the bit-slice
Shifted data
majority gate
Input bus to the
through the ROF Cells
Actual input data flow
GATE
MAJORITY
Actual Data output of
the last ROF Cell
RANK
REG
Rank bus
from the
control block
Bit-slice output
shift register
to the output
majority gateOutput of the
(fed back to the ROF Cells)
Rank for the
bit-slice
next lower
B
A
Y
MUX
S
CELL
ROF
CELL
ROFROF
CELL
Figure 4: The signal ow of one sorter slice bit-slice, showing how the data bits are circulated.
is calculated by evaluating all of the bits in each bit-plane
according to the current rank value (rank_order ). The
algorithm is nished after the bit-plane corresponding to
the least-signicant bits is processed with the last rank
value (lastDesiredRank).
The operation of the proposed sorting algorithm is illus-
trated with an example in Fig. 3. Here, ve 4-bit vectors
(A through E) are being sorted by the ROF core. The rst
rank (R1) is initialy applied to the MSB plane consisting of
the bits A1 through E1. In the next clock cycle, the same
rank is used to process the lesser-signicant bit-plane (A2
through E2), while a new rank (R2) is being applied to
the MSB plane. Also note that the staggered data bits are
gradually circulated from the end of the chain to the front,
so that each vector in the window can be completely pro-
cessed. The entire operation requires only (m+n-1) clock
cycles after all input vectors are applied. It is important
to note that the time-complexity of the sorting operation
described above has a linear dependence both with respect
to window size (m) and with respect to bit-length (n).
4 Realization of the Sorting Engine
The proposed sorter architecture exploits the fact that the
modular ROF core described in Section 2 is capable of gen-
erating one output vector per clock cycle, corresponding
to the currently selected rank. If the ranking process is
repeated on the same set of vectors instead of processing
a continuous stream of new vectors, the members of the
vector set can be sorted in linear time by simply chang-
ing (increasing or decreasing) the rank in each clock cycle.
Figure 4 shows the circuit structure and the signal ow
of one sorter bit slice that is designed to implement the
bit-level operations described above. The multiplexer on
the input side is used for accepting the input vectors at
the rate of one vector per clock cycle, as well as for circu-
lating (rotating) the data until sorting is completed. The
so-called sorter core is simply constructed by stacking
n such bit-slices, as depicted in Fig. 5.
The overall architecture of the sorting engine is shown
in Fig. 5. The ow of data through the modular ROF
core is being regulated by complementary input and out-
put shift registers, which are used to stagger the individual
bit-planes of each input vector to enable bit-level pipelined
operation. The control logic is responsible for regulating
the data circulation path, and for applying the rank selec-
tion signals to the individual bit-planes, in ascending or
descending order. The fact that each individual bit-plane
is capable of processing a dierent rank at any given time
signicantly increases the overall eciency of this archi-
tecture. In a typical sorting run, the control logic simply
requests each bit-plane to process a dierent rank in each
clock cycle, either beginning from the maximum rank and
descending, or beginning from the minimum rank and as-
cending.
INPUT
SHIFT
REGISTER
ARRAY
used to obtain
data vectors.
bit-wise staggered
used to convert bit-wise
staggered output vectors
back to normal.
CONTROL
BLOCK
Control feedback
to the userfrom the user
to the majority gates
Control signals
Rank control bus
sorter data
input bus
OUTPUT
SHIFT
REGISTER
ARRAY
sorter data
output bus
DataMUX
control bus
SORTER CORE
Sorter slice - MSB
Sorter slice - MSB2
Sorter slice - MSB3
Sorter slice - LSB
Figure 5: Top-level blocks in the architecture of the pro-
posed sorter engine. Note that each slice in the sorter
core contains m ROF cells, one data MUX and one m-
input majority gate.
The proposed architecture has been described with
VHDL to verify its operation. Figures 6 and 7 show sim-
ulated results of the sorting operation on an arbitrarily
Figure 6: VHDL simulation results of the proposed sorter architecture (m=15, n=8, ascending order).
Figure 7: VHDL simulation results of the proposed sorter architecture (m=15, n=8, descending order).
ordered set of 15 vectors (m=15), each with a bit-length
of 8 bits (n=8). The user determines how many input
vectors are to be sorted (actualWindowSize, not shown
in Fig. 6 and Fig. 7) and in which direction the sorting
will occur (sortType) and provides these inputs to the
sorter block together with a request pulse (sortRequest).
As soon as the request comes, the sorter block produces
signal (sortActive) which stays at the logic high level as
long as the corresponding set of vectors is processed. It
can be seen that the rst output vector is generated with
a latency of (n-1) clock cycles, after the last vector of the
set is entered. The sorter block provides a signal to the
user (outputsValid ) which goes high right at the last ris-
ing edge of the clock before the rst vector is ready at the
output (sortDataOutput).
The mask layout of a (63x16-bit) sorter block was com-
pleted using 0.8 um CMOS technology, to evaluate the
area-eciency of the presented architecture. The entire
sorting engine occupies a silicon area of 37.7 sqmm, about
80% of which is dedicated to the ROF cells, and 10% of
which is dedicated to majority gates.
5 Summary
In this paper, we present a highly modular architecture
for the realization of high-speed binary sorting engines.
The architecture consists of (i) a regular "core" array that
is completely scalable to accommodate large window sizes
and bit-lengths, (ii) input/output shift registers, and (iii)
control logic to regulate the bit-level processing of data. It
was shown that the complexity of the proposed bit-serial
pipelined architecture increases linearly with the number
of input vectors (m) to be sorted, and with bit-length of
the input vectors (n). It was also demonstrated that the
proposed sorting engine is capable of producing a fully
sorted output vector set in (m+n-1) clock cycles, i.e., in
linear time.
References
[1] D.S. Richards, VLSI median lters, IEEE Trans. Acoust.,
Speech, Signal Processing , vol. 38, pp.145-152, Jan 1990.
[2] W.K. Lam and C.K. Li, Binary sorter by majority gate,
IEE Electronic Letters, Vol. 32, July 1996.
[3] P. Wendt et al., Stack lters, IEEE Trans. Acoust.,
Speech, Signal Processing , pp. 898-911, 1986.
[4] Y. Leblebici, F.K. Gurkaynak, D. Mlynek, A compact
31-input programmable majority gate based on capacitive
threshold logic, in Proc. ISCAS 1998.
[5] B.K. Kar, D.K. Pradhan, A new algorithm for order statis-
tic and sorting, IEEE Trans. on Signal Processing , vol. 41,
pp.2688-2694, August 1993.
[6] C.C. Lin, C.J. Kuo, Fast response 2-D rank order algo-
rithm by using max-min sorting network, Int. Conf. on
Image Processing 1996 , Vol. 1, pp. 403-406.
[7] . Hatrnaz, F.K. Gurkaynak, Y. Leblebici, A modular and
scalable architecture for the realization of high-speed pro-
grammable rank-order lters, ASIC'99 Proceedings, pp.
382-386, 1999.
