A Compact Modular Architecture for High-Speed Binary Sorting by Hatirnaz, I. et al.
A COMPACT MODULAR ARCHITECTURE FOR
HIGH-SPEED BINARY SORTING
. Hatrnaz, F. K. Gürkaynak, Y. Leblebici
Worcester Polytechnic Institute
Department of Electrical and Computer Engineering
Worcester, MA 01609-2280
ABSTRACT
A new algorithm and a new modular architecture are
presented for the realization of high-speed binary sort-
ing engines, based on ecient rank ordering. Capacitive
Threshold Logic (CTL) gates are utilized for the im-
plementation of the multi-input programmable major-
ity (voting) functions required in the architecture. The
overall complexity of the proposed bit-serial architec-
ture increases linearly with the number of input vectors
to be sorted (window size = m) and with the bit-length
of the input vectors (word size = n), and the sorter ar-
chitecture can be easily expanded to accommodate large
vector sets. Detailed simulations indicate that the sorter
structure can operate at sampling clock rates of up to
50 MHz, where the throughput is boosted by ne-grain
pipelining. It is demonstrated that the proposed sort-
ing engine is capable of producing a fully sorted output
vector set in (m+n-1) clock cycles.
1. INTRODUCTION
The task of sorting an arbitrarily ordered vector set ac-
cording to magnitude (either from-largest-to-smallest or
from-smallest-to-largest) is one of the fundamental op-
erations required in many digital signal processing ap-
plications. It is also an expensive operation in terms of
area-time complexity; software-based solutions require
word-level sorting and can become computationally in-
tensive, while the overall complexity of hardware-based
solutions usually increases very rapidly with the size of
the input vector set (number of vectors) and with the
bit-length of the input vectors [5], [1], [3]. The design of
ecient sorting engine architectures is therefore a signif-
icant challenge for overcoming the computational bot-
tleneck of the binary sorting problem. A number of
recent proposals for the realization of sorting networks
rely primarily on median or rank order lters (ROF), yet
their capabilities in terms of window size and bit-length
are typically limited due to rapidly increasing hardware
complexity [5], [6], [2].
In this work, we present a new bit-serial sorting algo-
rithm based on rank-ordering. The hardware realization
of this algorithm results in a compact and fully modular
sorting engine architecture that is capable of processing
a large number of input vectors in linear time (Fig. 1).
The overall architecture is completely scalable to accom-
modate a wide range of window sizes and bit-lengths,
and the hardware complexity only grows linearly with
both of these parameters. The proposed sorter architec-
ture is essentially based on a fully programmable mod-
ular ROF design that was presented earlier [9]. In the
following, we rst present the rank ordering architecture
in Section 2, followed by the proposed sorting algorithm
and its hardware realization.
V1
V2
V3
V4
V5
V6
V7
V8
V9
rank ordering of the
current window elements
i-th ranked vector
in the window (Maximum filter)
The largest vector
The median vector
(Median filter)
The smallest vector
(Minimum filter)
V2
V1 V7
V9
V3
V6
V8
V4
V5
In
pu
t V
ec
to
rs
 (w
ord
s)
Sliding window containing m vectors.
Time
Figure 1: Illustration of the rank-ordering process.
2. THE RANK ORDERING
ARCHITECTURE
A bit-serial algorithm rst proposed in [5] was chosen as
the basis of the programmable rank-order lter architec-
ture implemented in this work. In this algorithm, the
problem of nding a rank-order-selection for n-bit long
words is reduced to nding n rank-order-selections for
1-bit numbers.
The algorithm starts by processing the most signif-
icant bits (MSB) of the m=(2N + 1) words in the cur-
rent window, through an m-input programmable major-
ity gate, to yield the MSB of the desired lter output.
This output is then compared with the other MSBs of
the window elements. The vectors whose MSB is not
equal to the lter output have their MSB propagated
down by one position, replacing the less signicant bits
of the corresponding words.
The bit-serial operation ow of the algorithm de-
scribed above suggests a simple bit-level pipelined data
path architecture, consisting of data modier-propagator
blocks to handle ne-grained data selection, and major-
ity decision blocks to determine output bits.
A programmable rank-order lter of any window size
and bit-length can be realized by using the two main
blocks described above. The bit-length dictates the num-
ber of the majority decision gates, whereas the window
size determines the number of ROF-cells driving one of
these majority gates. This regular structure results in
the ROF core array shown in Fig. 2. The programmable
majority decision gates are realized using the capaci-
tive threshold logic (CTL) circuit architecture presented
earlier [4]. This allows simple implementation of pro-
grammable majority gates with up to 63 parallel inputs,
using a very small silicon area (625m x 130m for 63-
bit majority gate).
CELL
ROF MAJORITY
DECISIONCELL
ROF
CELL
ROF MAJORITY
DECISIONCELL
ROF
CELL
ROF
CELL
ROF MAJORITY
DECISIONCELL
ROF
CELL
ROF
CELL
ROF
2
2 2 2
p
m22
m
m222
m
Rank Bus
data(m-1)
data(m-2)
data(0) coreOut(0)
coreOut(m-2)
coreOut(m-1)
Figure 2: The ROF core as proposed in the ROF archi-
tecture.
The signal ow between the ROF cells and the ma-
jority gates are also shown in Figure 2. The modular
architecture consisting of only two major blocks enables
fully scalable construction of lter structures of arbi-
trary size. It also forms the basis of the sorting algo-
rithm described in the next section.
3. THE SORTING ALGORITHM
The proposed sorting algorithm is a bit-serial algorithm,
whose input is a window of m n-bit words. The out-
put corresponds to a sequence of the input vectors in a
desired rank order. It starts by processing the MSBs of
the m input vectors in the current window. Each bit-
plane has its own rank value which is used to calculate
the corresponding slice output.
The pseudo-code of the proposed sorting algorithm
is given below. The algorithm involves two loops; the
outer loop initializes the rank value for the next iteration
and check if the sorting operation is nished, whereas
the inner loop does the actual sorting operation by per-
forming parallel instructions on n bit-planes.
rankof(1):= firstDesiredRank;
do{
-- Rank initialization
if (rankof(1) != lastDesiredRank)
rankof(0):= nextDesiredRank;
-- Main operation starts
for all bit-planes
do{
if (all_the_bits(selectedBitplane) are in the core)
then shift_rotate(selectedBitplane);
else
shift(selectedBitplane);
end if;
selectedRank := rankof(selectedBitplane);
outputWordVector(selectedRank, selectedBitplane) :=
rank_order(selectedBitplane, selectedRank);
rankof(selectedBitplane) :=
rankof(selectedBitplane - 1);
}
}while (rankof(n) != lastDesiredRank)
Listing of the proposed sorting algorithm.
The very rst step of the algorithm is to set the rank
value of the most-signicant bit-plane to the rst de-
sired rank (rstDesiredRank), whose value depends on
whether the input vectors are to be sorted in ascend-
ing or descending order. For example, if we consider
the case of sorting the input vectors in ascending or-
der; at the rst iteration of the main operation loop
(inner loop), the rank value corresponding to the most-
signicant bit-plane (rankof(1)) has to be set to small-
estRank , which results in ltering out the smallest in-
put word. Also, the rank values for the next iterations
(nextDesiredRank) are determined by the sorting di-
rection. If sorting is in ascending order, the rankof(0)
will be assigned rankof(1) + 1, until the rank value
corresponding to the MSB-plane will be equal to the
upper rank value (lastDesiredRank), which will be the
 largestRank . It should also be noted that the algo-
rithm can be used for sorting the input vectors in any
desired order. In this case, a look-up table may be used
to provide the necessary sequence of rank values.
The operations contained in the inner loop are per-
formed at the same time on all bit-planes. After the m
bits in each bit-plane are arranged either by shifting or
by shifting&rotating, the corresponding bit-plane out-
put is calculated by evaluating all of the bits in each bit-
plane according to the current rank value (rank_order ).
The algorithm is nished after the bit-plane correspond-
ing to the least-signicant bits is processed with the last
rank value (lastDesiredRank).
The operation of the proposed sorting algorithm is
illustrated with an example in Fig. 3. Here, ve 4-
bit vectors (A through E) are being sorted by the ROF
core. The rst rank (R1) is initialy applied to the MSB
plane consisting of the bits A1 through E1. In the next
clock cycle, the same rank is used to process the lesser-
signicant bit-plane (A2 through E2), while a new rank
(R2) is being applied to the MSB plane. Also note that
the staggered data bits are gradually circulated from
the end of the chain to the front, so that each vector
in the window can be completely processed. The entire
operation requires only (m+n-1) clock cycles after all
input vectors are applied. It is important to note that
the time-complexity of the sorting operation described
above has a linear dependence both with respect to win-
dow size (m) and with respect to bit-length (n).
R2
R1
R2
R1
R3
R4
R2
R1
R3
R5R2 R4R3
R3 R5R4
D1 C1 B1 A1
A2B2C2
B3 A3
A4C4
D3
E2
E3
D4E4
C3
E1
D2
B4
D1B1 A1 E1
E2 D2 C2A2
D3 C3 B3
A4B4C4D4
E3
C1
B2
E4
A3
C1D1B1 A1 E1
B2E2 D2 C2A2
D3 C3 B3 A3
A4B4C4D4
E3
E4
A2
B1C1
B2
A1
E2
D3
C4 B4
C3
D2
E1 D1
C2
B3
A4
A3E3
E4 D4
R1
R4 R5
R5
(b)
(c)
(d)
(a)
Figure 3: Illustration of a sorting operation on ve 4-
bit input vectors: (a) The staggered input vectors are
shifted into the ROF core, and the rst rank (R1) is
applied to the MSB plane. (b) The MSB of the rst
input vector (A1) is rotated, R1 is applied to the next
bit-plane, and the new rank R2 is applied to the MSB
plane. (c) B1 and A2 are rotated, while R1 is applied to
the lesser-signicant bit-plane. The rank R2 shifts down
by one, while R3 is applied to the MSB plane. (d) Bit
circulation continues, while the ranks propagate down
the bit-planes in descending order.
4. REALIZATION OF THE SORTING
ENGINE
The proposed sorter architecture exploits the fact that
the modular ROF core described in Section 2 is capable
of generating one output vector per clock cycle, corre-
sponding to the currently selected rank. If the ranking
process is repeated on the same set of vectors instead
of processing a continuous stream of new vectors, the
members of the vector set can be sorted in linear time
by simply changing (increasing or decreasing) the rank
in each clock cycle. The overall architecture of the sort-
ing engine is shown in Fig. 4. The ow of data through
the modular ROF core is being regulated by complemen-
tary input and output shift registers, which are used to
stagger the individual bit-planes of each input vector
to enable bit-level pipelined operation. The multiplexer
on the input side is used for accepting the input vec-
tors at the rate of one vector per clock cycle, as well as
for circulating (rotating) the data until sorting is com-
pleted. The control logic is responsible for regulating
the data circulation path, and for applying the rank se-
lection signals to the individual bit-planes, in ascending
or descending order. The fact that each individual bit-
plane is capable of processing a dierent rank at any
given time signicantly increases the overall eciency
of this architecture. In a typical sorting run, the control
logic simply requests each bit-plane to process a dier-
ent rank in each clock cycle, either beginning from the
maximum rank and descending, or beginning from the
minimum rank and ascending.
ROF cells majority gates
CONTROL
ROF
CELL
ROF
CELL
ROF
CELL
ROF
CELL
ROF
CELL
ROF
CELL
ROF
CELL
ROF
CELL
ROF
CELL
INPUT
SHIFT
REGISTER
ARRAY
output
sorter
data circulation path
OUTPUT
SHIFT
REGISTER
ARRAY
rank control bus
DATA
MUX
MAJORITY
MAJORITY
MAJORITY
data input
Figure 4: Overall architecture of the proposed sorter
engine.
The proposed architecture has been described with
VHDL to verify its operation. Fig. 5 shows simulated
results of two sorting operations on an arbitrarily or-
dered set of eight vectors, each with a bit-length of 8
bits. It can be seen that the rst output vector is gen-
erated with a latency of (n-1) clock cycles, after the last
vector of the set is entered.
Figure 5: Simulation results of a ranking operation on
an arbitrarily ordered set of eight vectors. The input
set is being sorted in descending order from maximum
(233) to minimum (11) value. The input set can also be
sorted in ascending order from minimum to maximum
value, simply by changing the rank sequence applied to
the majority gates of each bit-plane.
5. CONCLUSION
A modular architecture has been presented for the real-
ization of high-speed binary sorting engines, based on an
ecient rank ordering scheme. The overall complexity
of the proposed bit-serial architecture increases linearly
with the number of input vectors to be sorted and with
the bit-length of the input vectors. It was demonstrated
that the proposed sorting engine is capable of producing
a fully sorted output vector set in (m+n-1) clock cycles.
6. REFERENCES
[1] D.S. Richards, VLSI median lters, IEEE Trans.
Acoust., Speech, Signal Processing , vol. 38, pp.145-152,
January, 1990.
[2] W.K. Lam and C.K. Li, Binary sorter by majority gate,
IEE Electronic Letters, Vol. 32, July 1996.
[3] P. Wendt et al., Stack lters, IEEE Trans. Acoust.,
Speech, Signal Processing , pp. 898-911, 1986.
[4] Y. Leblebici, F.K. Gurkaynak, D. Mlynek, A compact
31-input programmable majority gate based on capaci-
tive threshold logic, in Proc. IEEE Int. ASIC Confer-
ence 1998 , pp. 281-285.
[5] B.K. Kar, D.K. Pradhan, A new algorithm for order
statistic and sorting, IEEE Trans. on Signal Processing ,
vol. 41, pp.2688-2694, August 1993.
[6] C.C. Lin, C.J. Kuo, Fast response 2-D rank order algo-
rithm by using max-min sorting network, International
Conference on Image Processing 1996 , Vol. 1, pp. 403-
406.
[7] C. Chen, L. Chen, T. Chiueh, J. Hsiao, An ecient
pipelined VLSI implementation of rank order lter, IS-
SIPNN 1994 , Vol. 2, pp. 630-633.
[8] C.L. Lee and C.W. Jen, Bit-sliced median lter design
based on majority gate, in Proc. Ins. Elec. Eng.-G, vol
139, pp.63-71, 1992.
[9] . Hatrnaz, F.K. Gurkaynak, Y. Leblebici, A modu-
lar and scalable architecture for the realization of high-
speed programmable rank-order lters, ASIC/SOC'99
Proceedings, pp. 382-386, 1999.
