A Minimum Area VLSI Architecture for O(logn) Time Sorting by Bilardi, G. & Preparata, F.P.
REPORT ACT-45 NOVEMBER 1983
S SCOORDINATED SCIENCE LABORATORY
APPLIED COMPUTATION THEORY GROUP
A MINIMUM AREA VLSI 
ARCHITECTURE FOR O(LOGN) 
TIME SORTING
G. BILARDI 
ER PREPARATA
APPROVED FOR PUBLIC RELEASE. DISTRIBUTION UNLIMITED.
REPORT R-1006 UILU-ENG 83-2227
UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
II
I
I
I
I
I
I
I
I
I
Unclassified
s e c u r i t y  c l a s s i f i c a t i o n  o f  t h i s  p a g e
REPORT DOCUMENTATION PAGE
1a. R E P O R T  S E C U R I T Y  C L A S S I F I C A T I O N
Unclassified_________________
2a. S E C U R I T Y  C L A S S I F I C A T I O N  A U T H O R I T Y
1b. R E S T R I C T I V E  M A R K I N G S
____ None
N/A
2b. D E C L A S S I  F I C A T I  O N / D O W N G R A D I N G  S C H E D U L E
N/A
3 D IS T R I  B U T I Q N / A  V A I  L ABJ L I T Y  O F  R E P O R TApproved for public release, distribution 
unlimited.
4. P E R F O R M I N G  O R G A N I Z A T I O N  R E P O R T  N U M B E R ( S )
R-1006 ; UILU-ENG 83-2227; ACT-45
5. M O N I T O R I N G  O R G A N I Z A T I O N  R E P O R T  N U M B E R ( S )
N/A
6a. N A M E  OF P E R F O R M I N G  O R G A N I Z A T I O N
Coordinated Science 
Laboratory, Univ. of Illinois
6c. A O D R E S S  (C ity , S ta te  an d  Z IP  C ode)
1101 W. Springfield Avenue 
Urbana, IL 61801
6b. O F F I C E  S Y M B O L  
( I f  ap p licab le )
N/A
7a. N A M E  O F  M O N I T O R I N G  O R G A N I Z A T I O N
Office of Naval Research
7b. A D D R E S S  (C ity , S ta te  and  Z IP  C ode)
800 N. Quincy Street 
Arlington, VA
8a. N A M E  O F  F U N O IN G / S P O N S O R I N G  
O R G A N I Z A T I O N
Joint Services Electronics
.P-rorr -raTT,  ______________________ ___
A D D R E S S  (C ity , S ta te  an d  Z IP  C ode)
8b. O F F I C E  S Y M B O L  
(I f  applicable )
Ä/ A.
9. P R O C U R E M E N T  I N S T R U M E N T  I D E N T I F I C A T I O N  N U M B E R
Contract N00014-79-C-0424
10. S O U R C E  O F  F U N D I N G  NOS.
800 N. Quincy Street 
Arlington, VA
P R O G R A M  
E L E M E N T  NO.
P R O J E C T
NO.
T A S K
NO.
W O R K  U N I T  
NO.
11. T I T L E  (Include S ecurity  C lassification ) A Minimum Area
VLSI Architecture for 0(logn) Time Sorting N/A N/A N/A N/A
12. P E R S O N A L  A U T H O R ( S )
Bilardi, G. and Preparata. F. P.
13a. T Y P E  OF R E P O R T
Technical
13b. T I M E  C O V E R E D  
F R O M ______________  T O
14. O A T E  O F  R E P O R T  (Y r., M o.. D ay)
November 1983
15. P A G E  C O U N T  
26
16. S U P P L E M E N T A R Y  N O T A T I O N
N/A
17.
F I E L D
C O S A T I  C O D E S
G R O U P SU B . GR.
18. S U B J E C T  T E R M S  (C o n tin u e on reverse if  necessary an d  id en tify  by block n u m b e r )
VLSI complexity, area-time trade-off, combination sorting, 
bitonic merging, cube-connected-cycles, mesh, orthogonal 
trees, optimal algorithms, parallel computation
j 19. A B S T R A C T  (C o n tin u e  on reverse if  necessary an d  id en tify  by block n u m b er)
A generalization of a known class of parallel sorting algorithms is presented, 
together with a new architecture to execute them. A VLSI implementation is also proposed, 
and its area-time performance is discussed. It is shown that an algorithm in the class is 
jexecutable in 0(logn) time by a chip occupying 0(n2) area. The design is a typical 
instance of a "hybrid architecture", resulting from the combination of well-known VLSI 
arrays as the orthogonal-trees and the cube-connected-cycles; it is also the first known 
to meet the AT = ft(n log n) lower bound for sorters of n words of length (H-e)logn(e > 0), 
and working in minimum 0(logn) time.
1 20. O IS T R I  B U T I O N / A  V A I L A B I  L I T Y  O F  A B S T R A C T  
l u N C L A S S I F I E D / U N L I M I T E D  LSI S A M E  A S  RPT. 0 O T I C  U S E R S  0
21. A B S T R A C T  S E C U R I T Y  C L A S S I F I C A T I O N
Unclassif ied
1 22a. N A M E  O F  R E S P O N S IB L E  I N D I V I D U A L 22b. T E L E P H O N E  N U M B E R  
(In c lu d e  A rea C ode)
22c. O F F I C E  S Y M B O L
NONE
■DD FORM 1473 33 APR edition o = i .an -•?
A MINIMUM AREA VLSI ARCHITECTURE FOR O(LOGN) TIME SORTING
,G. Bilardi and F. P. Preparata 
Coordinated Science Laboratory 
University of Illinois at Urbana-Champaign 
1101 West Springfield Avenue 
Urbana, IL 61801
Abstract: A generalization of a known class of parallel sorting algorithms
is presented, together with a new architecture to execute them. A VLSI
implementation is also proposed, and its area-time performance is discussed.
It is shown that an algorithm in the class is executable in O(logn) time by
2a chip occupying 0(n ) area. The design is a typical instance of a "hybrid
architecture", resulting from the combination of well-known VLSI arrays as
the orthogonal-trees and the cube-connected-cycles; it is also the first
2 2 2known to meet the AT = ft(n log n) lower bound for sorters of n words of 
length (l+e)logn(e > 0),and working in minimum O(logn) time.
This work was supported in part by the Joint Services Electronics 
Program under Contract N00014-79-C-0424 and by an IBM predoctoral Fellowship.
21. Introduction
Sorting is one of the most widely studied problems from the computational 
point of view, and many algorithms have been proposed for its solution. Since the 
possibility of parallel computation has been considered, several parallel schemes 
have also been proposed for sorting. Different models for parallel computers 
are possible, and several have been considered in the literature during the 
past years. Recently, the advent of the Very Large Scale Integrated circuits 
(VLSI) has motivated the definition of a new model of computation that aims 
at capturing the essential features of the new technology. Obviously, sorting 
has been one of the first problems studied in the VLSI environment, and 
several results are already available. In particular Thompson [1] gives a 
survey of thirteen algorithms for sorting and discusses their performance in 
terms of the chip area A and of the time T that elapses between beginning and 
completion of a computation. Indeed, area and time are natural measures of 
complexity for VLSI circuits, reflecting production cost and incremental cost 
respectively.
A theoretical argument, due to Thompson [ 2 ], shows that any sorter of n
terms, with wordlength q = (l+e)logn, with e > 0, must satisfy the relationship
2 2 2AT = fi(n log n). The argument is based on the facts that any chip that sorts
must support a flow of cp = ft(nlogn) bits through a suitable bisection, and
2 2that AT = ((p ). This lower bound holds in a suitable VLSI model of computation
whose basic assumptions are that the chip is synchronous (transmission time is 
independent of wire length) and semellective-unilocal (input data are read 
only once, at prespecified input ports). A word-local restriction is also assumed
for the input format (all the bits of the same word enter the circuit at the 
same point).
In a previous paper [3] we have shown that optimal VLSI sorters can indeed 
be constructed for all computation times T € [ft(log3n) ,0(i4iiogn) ]. These 
sorters are based on a new architecture, the Pleated-Cube-Connected-Cycles 
(PCCC), and execute bitonic sorting [4 ].
3In this paper we concentrate on 'Very fast” sorting, i.e., the class of 
VLSI sorting algorithms whose running time is T - 0(logn). So far only one 
VLSI design is known to achieve 9(logn) computation time: it is based on the 
orthogonal trees architecture [5], [6] and implements an algorithm due to
Muller and Preparata [ 7].^ The optimal layout of the orthogonal trees has area
2 2 2 A * 0(n log n) [6], while the lower bound yields A * 8(n ) for
T ■ O(logn). On the other hand, a closer analysis of the algorithm shows that
the information flow <p is O(nlogn), so that the gap between upper and lower
bounds is not due to a gap between actual flow and a flow-based lower bound,
but it is due to the fact that the length of the layout bisection of the
orthogonal trees is O(logn) times as large as the graph bisection.
We will show in this paper that not only the lower.bound on the flow,
2but also the one on the AT measure is tight, by exhibiting a new
2architecture capable of sorting in A * 0(n ) and time T ■ O(logn).
The rather complex network is a typical instance of the "hybrid 
architecture", resulting from the careful interplay of more standard VLSI 
networks, as the cube—connected—cycles machine, the mesh—connected machine, 
and the binary-tree machine. The implemented algorithm is of the type first 
introduced by Preparata [8], although the recursion strategy has been modified 
to optimize the network area.
A slight modification of one of the building blocks of our sorter turns 
out to be an interesting network in its own right. It is called the mesh of CCC's, 
and is a powerful emulator of the binary cube, matching the performance of both 
the CCC and the PCCC machines.
A suitable combination of one O(logn) sorter and one mesh of CCC's 
of proper size will allow us to construct an AT2-optimal sorter for any 
computation time T € [^(logn),0(log3n)].
Subsequent to the research leading to this paper, we learned of the 
construction of Aitai, Komlos., and Szemeredi [13], which also achieves
9(logn) time; in addition we have devised a VLSI implementation of their, . 2 o o
4• * 2 2 2 Thus we are able to conclude that optimal AT = e(n log n) sorting is
achievable in the entire "meaningful" range of computation times
T € [n.(logn),0(/nlogn)]. (Simple fan-in arguments show that ft(logn) is a
lower bound for the computation time, and A = Sl(nlogn) is an immediate
consequence of the semellective assumption,so that computation times slower
that 0(»4ilogn) cannot result in smaller area.)
In Section 2 we introduce a general framework for sorting algorithms,
called COMBINE-SORT, which is based on an operation, COMBINATION,
generalizing the operation of MERGING from^two to several sequences. Section 3
and 4 describe an architecture (COMBINER) and an algorithm for COMBINATION,
respectively. The combiner network so obtained is then used in Section 5 as a
building block for a general class of COMBINATION-sorters. One of these sorters
2is shown to have optimal area A = 9(n ), for T = 9(logn) computation time. 
Finally, in Section 6, we discuss the area-time trade-off for sorting, and 
show that optimal sorters can be constructed for any computation within the 
above range.
52. A Class of Parallel Sorting Algorithms
Several sorting algorithms can be viewed as particular cases of a rather 
general scheme, which we now describe.
We call COMBINATION the operation that produces from m sorted sequences of 
t elements each one sorted sequence of mt elements. A network implementing 
this operation is called an (m,t)-COMBINER. When m = 2, COMBINATION reduces 
to merging.
A parallel algorithm for the (m,t)-COMBINER has been introduced in [8],
and is based on the following idea. The m input sequences S_,...,S are0 m—1
pairwise merged to compute for each i,j € {0,1,...,m-l}, and each A € {0,1,...,t-1},
the number C _  (A) of elements of sequence S_. that are less than the A-th element
of sequence S^. C ( A )  is readily obtained as the difference of the ranks of
this element in the merge of S. and S. and in S.. By summing the C (A)’s1 3 l ij
over j we then obtain the rank of the A-th element of in the output sequence 
of the COMBINER; thus, to complete the operation, we simply need to store each 
element in the position specified by its rank. The primitive operation of the 
scheme —  the merging of two sequences —  can be done, for example by Batcher’s 
bitonic merger [4].
Given n = m^m^.-.m^ elements, we can sort them in d stages according 
to the following scheme that we call COMBINE-SORT.
At stage 1 we perform n/m^ combination operations, each on m^ sequences of 
1 element each. At stage 2 we perform n/nyi^ combinations, each on m2 sequences 
of m^ elements each, and at stage i we perform n/m^ ... m^ combination, each on 
m^ sequences of length m^ ... m^ Finally, at stage d we combine m^ sequences 
of length n/m^ into one sequence of length n, which is the output of the
COMBINE-SORT scheme.
6A diagrammatic illustration of the scheme is given in Figure 1 in the form
of a rooted tree. Each node of this tree is a suitable combiner. An
(m^t.^-COMBINER, 1 _< i _< d, performs the combination of m^ (sorted) sequences
of length t ; here tn = 1 and t. - = m m_...m. 1 for i > 1. Note that each l-l 0 i-1 1 2 i-l
level of the tree corresponds to a stage of the combination scheme, and that 
there are n^ = n/t^ nodes at level i, 1 <_ i _< d.
v 1
‘a-“
V i ^ d
t
•
î
• •
t
0^,1)
COMBINER
0^,1)
COMBINER
1
• # • (mx,1) 
COMBINER
* 4  A
r r ~ r
n^=n/m^
h ~ * i
n0=n
V 1
Figure 1. Diagram of COMBINE-SORT scheme.
7Several known sorting algorithms can be cast in the COMBINE-SORT scheme.
Each algorithm is characterized by a particular factorization of n = m....m1 d
(note that the order of the factors is relevant here), and by the specification 
of how the combination is to be performed. In particular if we use the 
COMBINER based on [8] we have the following cases.
(i) When n = 2d, and m^ = = ... = m^ = 2, then COMBINE-SORT reduces to
the usual MERGE-SORT.
(ii) When d = 1, and m^ = n , the COMBINE-SORT reduces to only one
(n,l)-COMBINER, which is essentially the sorting network described 
in [ 7 ].
(üi) When d » loglogn/log(l/(1-a) ) , and m = -with 0 < a < 1,
we obtain the sorting schemes described in [8]. The sorting scheme
corresponding to a given a can be described as follows. The n-input sequence
is split into na (m, in our terminology) sequences of n ^  (t, _ in ourd d—1
terminology) elements each. These sequences are sorted recursively, and then
combined by an (m^,t^ ^)-COMBINER . The recursion stops when sequences of
length 1 are obtained. We can obtain the values for d and m.,...,m by a simple1 d
analysis of the unfolded recursive process.
In the following sections, we shall explore which other choices of d
and m_,m_,...,m, can be made to minimize the complexity of a VLSI 1 2  d
implementation of COMBINE-SORT.
83. An (m,t)-COMBINER Network
In this section we propose a parallel architecture for an (m,t)-COMBINER,
U Twhere m = 2 and t = 2 are powers of two. This architecture will 
accept as input m sorted sequences of t elements each,
s i  “ Cs±( 0 ),si(l),...,s±(t-l)) i = 0 ,1,...,m-l
and produce as output a single sorted sequence S, which is the combination of 
V * ’*,Sm-l’ an<* ^as N = mt * elements,
S = (s(0),s(l),...,s(N-l)).
The (m,t)-COMBINER will execute the algorithm based on pairwise merging
as outlined in the preceding section. Its organization is illustrated in
2Figure 2. It consists of m modules (each capable of merging two sequences of 
length t and of computing partial ranks), laid out as a square m x m mesh and 
indexed as M ^  (^>3 = 0,1,...,m—1). The modules of each row are interconnected as 
the leaves of a binary tree of bandwidth t; so are the modules of each column. 
Thus, the combiner has the structure of the orthogonal—trees machines [5,6], whose 
leaves are merging modules. The interconnecting trees have the following 
functions:
(i) to "broadcast" a sequence to all units in which it must be merged with 
some other sequence;
(ii) to compute global ranks from partial ranks;
(iii) to rearrange the elements according to their ranks into the sorted 
sequence S.
We will now describe in some detail the merging modules and the 
interconnecting trees .*
9RT-lines
Figure 2.
CT-lines
Overview of (m,t)-COMBINER, for m = 4.
10
3.1. Merging Modules
Merging module M „  will merge sequences S_^ and S_. and compute C^ _. (£) ,
for Z = 0,...,t-l. We recall that C..(£) is the number of elements of S
*3 J
that are less than (respectively less than or equal) s ± (Z) when i £  j, (when i > j).
Each module is realized (Figure 3) as a cube-connected-cycle (CCC), 
interconnection of smaller processing elements, called micromodules (each 
micromodule has a bandwidth of 1 bit). Specifically, the merging module is 
a (t+1,2t+^)-CCC (i.e., it has 2T+  ^cycles each of length x+1). We number the 
micromodules of M ^  as M_(h,k), with 0 £  h < t+1 and 0 £  k < 2 , so that the
, X+1merging module may be thought of as a (x+1) x 2 array (rows are numbered from
bottom to top, columns from left to right). The columns of this array are
connected as cycles with a link between M_(h,k) and M (h, (k+l)mod(x+l)). The rows
0,1,...,x are associated with the dimensions E_,E ,...,E of a (x+1)—dimensionalu l x
binary cube [9], and there is a link between M..(h,k ) and M..(h,k ) if and only if
■Lthe binary expansions of k and k^ differ exactly in the coefficient of 2 .
The reader is referred to [10] for a detailed explanation of the CCC; 
he must also be warned that in this paper we will not use the CCC at its full 
capability, since we deploy a network with 2t(log2t) (rather than 2t) micro­
modules to merge two sequences of length t. In other words, a 2T+1 binary cube 
is emulated by a (x+l,2X+^)-CCC. ^   ^ When the 2T+  ^items on which we operate have 
to be processed on the cube dimension E^, we just need to guarantee that the 
items are in row h of the CCC. Thus, execution of the ASCEND and DESCEND
paradigms, in which the dimensions are used in the sequence (En,E ,...,E ),0 1 x
and (E^,E^_^,...,Eq ) respectively, is quite straightforward.
3The layout of a (3,2 )-CCC in Figure 3 shows two sets of 4 input lines 
(denoted, respectively, as RT- and CT-lines) each carrying one of the two 
4-element sequences to be merged.
( 2) For this reason the number of micromodules in a cycle is not constrained 
to be a power of 2.
11
3Figure 3. Merging unit M realized by a (3,2 )-CCC, used to merge two 
sequences with four elements each.
Recalling from [10] that a CCC with N processing elements of constant area
2 2can be laid out in area 0(N /log N), we conclude that a merging module can
2be laid out in area 0(Agt ),where is the area of the micromodule. In 
the next section, in connection with the COMBINATION algorithm, we shall 
specify the functional capabilities of each micromodule, from which it will 
be clear that Aq is constant, i.e., independent of the problem size.
12
3.2. Interconnecting Trees
As indicated earlier, the merging modules are interconnected by two 
families of N = mt complete binary trees with m = 2^ leaves and bandwidth 1.
We will refer to these families as the row trees and column trees.
The lines of the row trees and the column trees are respectively labelled 
RT^ (Jl) and CT^Ji), i = 0,...,m-l; l = 0,...,t-l. The trees and the merging 
modules are connected through a small interface,whose structure will be fully 
specified in connection with the description of the COMBINATION algorithm in 
the next section. At this point we just say that the leaves of RT^(£) are, 
from left to right, connected to the CCC micromodules
M.rt(0,JO ,M._ (0,£),... ,M. _ (0,JO; the leaves of CT.(£) are connected to thelO il l ,m-l j
CCC micromodules M._ . (0,t-l+&) ,M„ . (0,t-l+&) > • • • . .(0,t-l+£); in otherOj lj m-1 ,j
words,the row trees and the column trees are respectively connected to the RT 
and the CT lines of the merging modules. The connection between each leaf of 
a tree and the corresponding CCC micromodule is realized through a buffer 
register of the appropriate size (adequate to store one element to be sorted). 
The situation is illustrated in Figure 4.
RT.(0)l
•
RTt(3)
Figure 4. Interconnection of modules and trees.
13
4. The COMBINATION Algorithm
We now describe how the sorting algorithm of [8] , based on pairwise 
merging, can be executed on the architecture introduced in Section 3. This analysis 
elucidate the structure of the CCC micromodules. We recall that the inputs
p Tare m - 2 sorted sequences of t = 2 elements each, with N = 2V = mt.
For convenience we split the algorithm into several phases.
(A) Input of Data and Broadcasting to Merging Modules
Element s ^ W  is input at the root of tree RT^iO, and is then broadcast
to all leaves of the tree. At this point, the left half of row(O) in
module M contains the sequence To fill the right halves of row(O) of
all modules, we proceed as follows. First, in each "diagonal" module M the
sequence Si is copied in the second half of row(O). (This can be done
by using the connection of row(x) between the left and the right half
of the machine.) Next, from micromodule Mjj(0,t-l+A), which is a leaf of
CT_.(£), element s_.(£) is broadcast (through the root) to all the other leaves of
the same tree. At this point, the merging module M.. contains S. and S inij i j
the 0-th row and merging can begin.
(B) Merging and Partial Rank Computation
Merging can be executed by resorting to the bitonic algorithm, which
complies with the DESCEND paradigm of the binary cube (see [10]).
However, in order to execute bitonic merging, we first need to reverse the
order of . This is accomplished by an ASCEND algorithm in which columns
t to 2t-l of each M.. exchange their data at dimensions E_,...,E ., whileij 0 t-1
columns 0 to t-1, remain idle. All the columns are idle at dimension E .T
Now the data are ready for bitonic merging. At each dimension,
Et ,E^ __^ , ... ,E^,Eq »pairs of elements are compared and exchanged, if necessary, 
to place the smaller of two in the column with the smaller number. Each
14
processor (micromodule) of the merging module is equipped with a serial
comparator that reads the inputs starting from the most significant bit. As
long as the two inputs agree, they are transmitted to the next processor in the
same column. As soon as a bit discrepancy is detected, a switch is set and, from
now on, the remaining substrings of the operands will follow a fixed path, respectively,
independently of their value. It is then easy to see how the computation
through the rows of the CCC can be naturally pipelined to achieve a computation
time of 0(x+q), where q is the length of the input words. At the end of merging,
the result resides in row(O) of the CCC, and the element in M..(0,£),
0 & <_ 2t-l, has rank i in MERGE(S^,S^). Now we want to transmit the ranks
of s (0).... s.(t-l) to processors M..(0,0),...,M..(0,t-l), respectively.i i  ij ij
This is accomplished by retracing backwards the path traversed by each element
s^(j), and is easily done if each M_(£,k) keeps track of whether it exchanged
or not the operands during the merging process. So, all we have to do is to
run the machine backwards, with an ASCEND algorithm, which applies to the ranks
the inverse of the permutation that merged the elements. At the end of
this phase, processor M_(0,£), 0 £ l <_ t-1, stores the number of elements in
MERGE(S^,S_.) that are less than s^(&) . If from this number we subtract l we
obtain C..(£), number of elements of S. which are less than s.(£). We call il 1 l
the C..’s partial ranks because from them we can compute the rank of each 
13 m-1
s (£) in the sorted sequence S as C.(il) = I C..(Jl).
1 i  j* 0 *
(C) Total Rank Computation
It is immediate to see that at the end of phase B the partial ranks
^i0^)*Cil(^)»*#*,Cijm_i W  °f s^(£) are available exactly at the leaves 
of row tree RT^i,) . By having in each internal node of the tree a full adder 
with a 1-bit delay feedback on the carry, we can then obtain at the root
15
<5f RT\ the sum C^U) of the values stored at the leaves. The nodes work as 
serial adders and the tree is used in a pipelined fashion, so that the time 
required is 0(y+x), where y = logm is the depth of the tree, and x+1 is the 
wordlength of the operands (note that C (£) <_ 2T). Within the same order of 
time, we can subsequently broadcast (^ (2,) from the root to the leaves.
(Indeed C^ (il) < 2  , so it can be expressed by x+y bits.)
(D) Sorting Permutation and Output of Data
We want to output the elements s(0),...,s(N-l) of the sorted sequence from 
the roots of the column trees, and, specifically, we want the root of CT.(£) to 
output element s(j2 +£) . This corresponds to a natural left-to-right order 
of the column trees as they appear in the layout of Figure 2.
Considering a generic element s^h) with rank C^h) , the binary spellings 
of the integers j and Z so that s^(h) will emerge from the root of column tree 
CT^(£) are readily obtained by taking the y most significant bits and the x 
less significant bits of the rank C^(h) to represent Z and j, respectively. 
Thus, as a first step, we "activate" in M.. the elements of sequence S that 
have to emerge from trees CT_.’S, and "inhibit" all other elements. The 
active elements are those whose rank C^(h) has the y most significant bits 
agreeing with the column number j of the merging module. Next, we rearrange 
the active elements in M.. so that s.(h) is sent to M..(0,£), with 
Z = Ci(h)mod t.
This operation is essentially a permutation of the active (and non-active) 
elements, and can be done by using the CCC as an emulator of the Benes-network. 
The setting of the switches, although nontrivial, is greatly simplified with 
respect to the general case by the fact that the active elements do not change 
their relative order. The desired rearrangement can be done by using the 
idea of concentration introduced in [11], and expansion, which could be
viewed as the inverse of concentration. If k elements are active in a
16
given module, they are first sent to the k leftmost columns of the CCC 
(concentration), and then routed to the destination columns (expansion).
A straightforward adaptation of the algorithm that is proposed in [11] for concentration
in the cube-machine shows that an ASCEND and a DESCEND phase is all that is
required to rearrange data on our CCC. Some bits required to set the switches must be
precomputed. This task could be performed by the CCC, or (to keep the
micromodule structure as simple as possible), the task can be assigned to a
binary tree of full adders whose leaves would be contained in the interface
between the CCC and the row-trees.
During the entire rearrangement task, computation takes place only in 
the left-half of the CCC without using dimension E . We then transfer each
T
active element from M„(0,£) to M^. (0,t-l+$,)» with a straightforward use of 
dimension E .
T
TAt this point element s(j2 +£) is in M (0,t-l+£), (where the value of i is 
determined by the input sequence to which s(j2 +1) originally belongs to), and 
is ready to be transmitted to the root of CT_. (&) where it is output.
4.1. Performance Analysis and Modification of the Network
The entire„machine, even when not explicitly said, is intended to work in 
bit serial mode. Both the CCC's and the trees work in a pipeline fashion.
Thus any operation takes essentially time proportional to the sum of the 
operand length and the pipe depth. For the CCC’s the depth is x+1. The operands 
to be handled have length q when they are input words or x+1 when they are 
partial ranks. Since a constant number of ASCEND and DESCEND algorithms are 
executed, we conclude that 0(x+q) total time is spent in the CCC’s. For the 
trees the depth is y+1. The operands to be handled have length q when they are 
input words, or x+y when they are total ranks. Since a constant number of
17
fan-in and fan-out algorithms are executed, we conclude that 0(x+y+q) total 
time is spent in the trees. Thus the time spent in the interconnecting trees 
dominates that spent in the CCC’s, and we reach the conclusion that the 
(2^,2T)-COMBINER of elements of q bits works in time T = 0(x+y+q).
So far, all the parameters t , y, and q have been regarded as independent of 
each other. We now make an interesting observation. When q = f2(2^ ), then 
T = 0(T+q). In this case the time performance of the trees is not substantially 
degraded if we realize them as comb-trees, rather than as complete binary trees. 
The depth will go from y to 2^, but this is tolerable in time since 2  ^= 0(q).
On the other hand comb-trees can be laid out in constant rather than logarithmic 
width, thus yielding a saving in area. The modified (2^,2X)-COMBINER of
words of length q = fi(2^ ) has then T = 0(x+q)=0(logN+q) and A = 0(22 x^+^) = 0(N2) .
4.2. Summary of Symbols and Results for an (m,t)-COMBINER 
Sizes: m = 2^, t = 2X , N = mt, q = wordlength
Input sequences:
S± = (8^(0),8^(1),...jS^Ct-l)) i = 0,1,...,m-1 .
Output sequence:
S = (s(0),s(l),...,s(N-l)).
T+lMerging modules: (x+1,2 )-CCC’s
M „  : i, j = 0,1,..., m— 1
\
M_(h,k); 0 _< h < x+1, 0 _< k < 2t, micromodules of M „  .
Row-trees and column-trees:
R ^ U ) ,  CT^ . (Z) : 0 < i,j < m-1, 0 < l < t-1.
Machine Performance
A T
Full tree version 2 2 0(N n ) 0(x+y+q)
Comb-tree version 
q = fi(m)
0(N2) 0(x+q)
18
5. An Architecture for COMBINATION-SORT
We shall now use the COMBINER developed in the two preceding sections to 
construct a general network for COMBINATION-SORT. As an intermediate step in 
the construction, we introduce a new operation called COALESCENCE. Given a 
collection of n elements, partitioned into n/ti  ^sorted subsequences each 
containing t^_^ elements, and given a multiple t o f  t which is also a 
divisor of n, we call (n;t^ ^:t^)-COALESCENCE the operation of combining (in the 
sense defined earlier) consecutive blocks of m^ = t^ /t_^   ^su^sequences.
If we refer to the tree of Figure 1, we can easily see that each level 
of the tree corresponds to a coalescence of the input sequence. If we call 
COALESCER a network that performs a coalescence, we can build a COMBINATION- 
SORTER by cascading a suitable set of coalescers, as shown in Figure 5.
Figure 5. COMBINATION-SORTER as a cascade of COALESCERS.
5.1. The COALESCER
An (n;t^_^ ; t_^)-COALESCER can be easily constructed by using n^ = n/t. 
(mi»ti_i)-COMBINERS. Let us assume, for simplicity, that n^ is a perfect 
square. We can then lay out the combiners in a /n7 x i^ nT array with input and 
output lines running in a chosen direction, say, parallel to the rows. An 
example with n. = 4 is shown in Figure 6.
19
lines
1
I
N
P
U
T
S
lines
0
U
T
P
U
T
S
Figure 6. Layout of an : t±)-COALESCER with n ±
(mi=ti/ti > r ti-i)-C0MBINERS-
n/t.l
We now estimate the area of the COALESCER. We first assume to use
full-tree COMBINERS, so that the side of the COMBINER has a
For the layout shown in Figure 6, we then have:
_ _ logm
height = 0(vxi7 t.logm.+n.t ) = 0(n(lH------ ))
1  1  1  i  i  /—
length of OCt^logm^).
logm.
width = 0(i/n7 t.logm,) = 0(n(----- ))l i  i >—
* 1
If instead we use comb-tree COMBINERS, the size becomes
theight = 0(n)
logm.
width = 0(n ----- )
The computation time is T = 0 (T+q+logm. ) for the full-tree COALESCER, andr i
T =0(x+q+m.) for the comb-tree COALESCER. When q = 9(logn), then T_ = O(logn) . C i  F
If, in addition, m. = O(logn), then T = O(logn).— C
5.2. An Optimal VLSI Sorter
From the previous considerations it is easy to see that we can obtain
a VLSI implementation of COMBINATION-SORTERS by suitable use of COALESCERS.
It should also be easy to compute time and area, once the factorization
n = m^m^.-.m^ for the algorithm is chosen.
We now show that there is a COMBINATION-SORTER for words of length
2q = e(logn) that sorts n elements in time T = O(logn) and area A = 0(n ), 
thus achieving the known lower bound for this problem. The sorter we propose 
is given by the block diagram in Figure 7.
Figure 7. A COMBINATION-SORTER with three COALESCERS, for optimal VLSI sorting.
From the general analysis we easily see that the coalescers take area 
(width x height) 0(n) x 0(n), O(nloglogn/Aogn) x 0(n), and 0(n) x 0(n) 
respectively. It is also clear that the total time is O(logn), thus our claim 
is proved.
i
22
6. Area-Time Trade-Off
The COMBINATION-sorter proposed in this paper has optimal area among sorters 
that achieve minimum computation times. It is now interesting to ask whether
we can trade time for area, and build a slower but smaller sorter with optimal
2 2 2 AT = 9(n log n).
Since, as we already recalled in the Introduction, area-time optimal 
circuits for sorting can be built when T € [ft(log n),0(/nlogn)] (for a 
(H-e)logn wordlength) the range of cpmputation times for which no optimal circuits
3is known yet is [ft(logn)£)(log n)].
We will now describe a network, which, by choosing an appropriate value for
2 2 2a design parameter, allows us to sort in A = 0(n log n/T ) for any time
T € [ft(logn),0(/nlogn)]. The network is the cascade interconnection of two
components. The first component is a COMBINATION-sorter for — inputs. The seconds
component is a new general architecture, called the mesh-of-CCC (MCCC) and 
obtained by suitably "hybridizing" known networks (the mesh and the CCC).
This architecture will now be described in detail.
An (n,s)—MCCC, with n = 2 , s = 2 , and r = n/s^ = 2^(p=v—2a) consists of
2s CCC modules, each with r cycles of length p. The n x p
processing elements of the MCCC are conveniently indexed as 
M_(h,k): 0 £ i, j < s, 0 £  h < p, 0 < k < r.
For a fixed (i,j) pair the set {M_(h,k):0 £ h < p ,0 £  k < r} is connected as 
a CCC-module, exactly as described in Section 3. Then CCC modules are arranged 
as an s x s mesh, and, for a fixed k,the set of micromodules {M_(0,k):0 < i,j < s} 
is mesh—connected (with i and j as row and column indices respectively).
23
The MCCC closely resembles the COMBINER architecture defined in Section 4, and 
more specifically, the version with comb-trees. In fact the MCCC could be 
obtained from the comb-tree connected CCC’s by identifying in all CCC's 
micromodules M_(h,k) and M_(h,k+t) (with 0 _< k < t) , and deleting the edges 
related to E .
The mesh of CCC's is a very interesting network in its own right, and we
shall now show how it can: (i) emulate the ASCEND (or DESCEND) paradigm [10]
2 2 2of the Binary-Cube in optimal AT = 0(n log n) for any computation time 
2T € [ft(log n),0(/nlogn)]; (ii) emulate the SORTING paradigm [3] in optimal
2 2 2 3 _____AT = 0(n log n) for any computation time T € [ft(log n),0(/nlogn)]. (Recall
that we are referring to a 0(logn) input words, and to a word-serial mode of
operation.)
If we consider a v-dimensional binary cube whose processors are 
Pq »p -^» • • • *pn_^ (n = 2V) , we can establish the following correspondence between 
MCCC micromodules, and cube processors:
M (0 ,k )-w  P , t  = 2. j  + S _ j. + k .
* J  t  S  6s
Then it is easy to see that dimension Eq ,...,E^  ^of the cube are assigned to
the CCC modules, dimensions E ,...,E , . are assigned to the mesh columns,
p p+cj-1
and finally dimensions E , ,...,E , are assigned to the mesh rows. Thus,p+a p+za-1
by application of well known techniques for emulating the cube with a CCC or a 
linear array [10], an ASCEND (or DESCEND) algorithm can be executed in 0(p+s) 
word-steps.
2 2On the other hand the MCCC can be trivially laid out in an 0(n /s )
n2 2 2square, since each CCC requires 0(— area and channels of 0(n /s ) width
s
allow a straightforward implementation of mesh-connections. In conclusion,
for s in the range [ft(logn), 0(Vn/logn)], considering that p = O(logn) and
2 2that a word step takes O(logn) time, we obtain T = O(slogn) and A = 0(n /s ),
2which gives an optimal AT .
24
The MCCC, used in the way just described, would not be optimal for the 
execution of bitonic sorting. Bitonic sorting of n = 2V elements consists of 
v merging phases ,M^,... , with phase M_^  performing the merging of pairs
of sequences of length 2*, and requiring on the cube the successive use of
dimensions • • • »Ei»Eo # So, the schedule of use of dimensions for a
complete sorting is
V , E1 » E0 \  E2 ’ E1 ’ E0 ; Ev - 1 ,E v -2  * * * * *E0
«0 M1 M2 M 1 v -1
For brevity, we shall call this schedule [3] the sorting paradigm. On the MCCC 
the sorting paradigm requires O(plogn+slogs) word steps, more than we desire.
We can eliminate the logs factor by a technique (already successful in the
construction of the Pleated-CCC) consisting of an alternate arrangement of the 
2a topmost dimensions to columns and rows. More precisely, if
£ a-l 0 201 = 2 i 2 , j - z j 2\ V  = Z (2i + j ) 22* ,
Z=0 Z 1=0 1 1=0 1 1
then we establish between MCCC micromodules and cube processors the correspondence
M (0,k) -M- pfc, t = t ' 2 _ + k .
For this correspondence a simple argument (similar to one given in [3 ])
shows that only O(plogn+s) word-steps are required for execution of the sorting
paradigm.
2
Again, for s € [fl(log n) , 0(/n/logn)],recalling that O(logn) time is used 
for a word step, we obtain T = O(slogn), and A = 0(n2/s2). Although our main 
purpose in defining the MCCC is to construct optimal sorters f©r
3T € [ft(logn), 0(log n) ] , we have seen that the MCCC is an
25
optimal emulator of the cube for both the ASCEND and the SORTING paradigms.
Let us also point out another interesting feature of the MCCC, namely that
the maximum edge-length in the layout is 0(-^). For s = 0(logn) we obtain a
maximum (edge-length) = O(n/log2n), which is optimal/3-* In fact [12]
maxedge-length = ^(/optimal area/diameter) for any graph, and for the MCCC
2 2optimal area = 0(n /log n), and diameter = 0(logn). It is also interesting
to recall that the optimal layouts known for the CCC and the Shuffle-Exchange
contain edges of length O(n/logn).
To obtain networks faster than the MCCC we start from the following
observation. A COMBINE-sorter with n/s input can sort (in time O(slogn)
2 2and area 0(n /s )) s sequences of n/s elements each. These sequences can 
then be fed, say one per column, into an MCCC with parameter s. The sequence 
in each CCC module is at this point already sorted, and the MCCC is ready 
(after inverting the order of some sequences to comply with bitonic sorting 
rules) to execute the last 2a merging phases. (For the sake of simplicity we 
wi H  ignore the fact that only a phases would be really necessary after the 
work done by the COMBINE-sorter.) A simple analysis allows us to conclude that, 
in the process, the MCCC executes O(logs+s) steps using O(logn) time each
thus running for a total time T - O(slogn).
2In conclusion, when s £ [ft(l),0(log n)], the computation time T of the
3entire machine ranges in [ft(logn),0(log n)], and for each T the layout
2 2 2area is optimally 0(n log n/T ).
73)This property has been noted also by A. Aggarwal for an architecture very 
similar to the MCCC (private communication).
26
REFERENCES
1. C. D. Thompson, "The VLSI complexity of sorting," to appear.
2. C. D. Thompson, A Complexity Theory for VLSI, Ph.D. Thesis, Computer 
Science Department, Carnegie-Mellon Univ., Aug. 1980.
3. G. Bilardi, F. P. Preparata, "A VLSI optimal architecture for bitonic 
sorting," Proc. 7th Conf. on Information Sciences and Systems, The 
Johns Hopkins University, Baltimore, MD, (March 1983); pp. 1-5.
4. K. E. Batcher, "Sorting networks and their applications," Proc. AFIPS 
Spring Joint Computer Conference, voi. 32, pp. 307-314, April 1968.
5. D. D. Nath, S. N. Maheshwari, and P. C. P. Bhatt, "Efficient VLSI 
networks for parallel processing based on orthogonal trees," IEEE 
Trans. Comp., vol. C-32, no. 6, pp. 569-581, June 1983.
6. F. T. Leighton, "New lower bound techniques for VLSI," Proc. 22nd Symp. 
on the Foundations of Computer Science, IEEE Computer Society. Oct.
1 98T. —
7. D. E. Muller, F. P. Preparata, "Bounds to complexities of networks for 
sorting and for switching," JACM, voi. 22, pp. 195-201, April 1975.
8. F. P. Preparata, "New parallel sorting schemes," IEEE Trans. Comput., 
voi. C-27, no. 7, pp. 669-673, July 1978.
9. M. C. Pease, "The indirect binary n-cube microprocessor array," IEEE 
Trans. Comput., voi. C-26, no. 5, pp. 458-473, May 1977.
10. F. P. Preparata, J. Vuillemin, "The cube—connected—cycles : A versatile 
network for parallel computation," Com. of the ACM, voi. 24, no. 5, 
pp. 300-309, May 1981.
11. D. Nassimi, S. Sahni, "Parallel permutation and sorting algorithms and 
a new generalized connection network," JACM, voi. 29, no. 3,
pp. 642-667, July 1982.
12. F. T. Leighton, Layouts for the Shuffle-Exchange Graph and Lower Bound 
Techniques, Ph.D. Thesis, Dept, of Mathematics, MIT, August 1981.
13. M. Aitai, J. Komlos, E. Szemeredi, "An O(nlogn) sorting network," Proc. 
15th SIGACT, Boston, MA, April 1983, pp. 1-9.
