The VLSI Optimality of the AKS Sorting Network by Bilardi, Gianfranco & Preparata, Franco P.
ACT- 4 6 FEBRUARY 1984
S  *  COORDINATED SCIENCE LABORATORY
APPLIED COMPUTATION THEORY GROUP
THE VLSI OPTIMALITY OF THE 
AKS SORTING NETWORK
B ILA R D I, GIANFRANCO 
PREPARATA , FRANCO P.
APPROVED FOR PUBLIC RELEASE. DISTRIBUTION UNLIMITED.
REPORT R -1 0 08 U I LU-ENG 8 4 -2 2 0 2
UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
I
I
I
I
I
I
I
I
K
I
I
I
I
I
I
I
I
Unclassified
s e c u r i t y  c l a s s i f i c a t i o n  o f  t h i s  p a g e
REPORT DOCUMENTATION PAGE
1a. R E P O R T  S E C U R IT Y  C L A S S IF I C A T I O N
Unelassified
1b. R E S T R I C T I V E  M A R K I N G S
None
Za. S E C U R IT Y  C L A S S IF I C A T I O N  A U T H O R I T Y
N/A
2b. OE CLASSI F I C A T I O N / D O W N G R A D I N G  S C H E D U L E
N/A
3. O IS T R I  8 U T I O N / A  V A I  L A 8 I L I T Y  O F  R E P O R T
Approved for public release, distribution 
unlimited.
A. P E R F O R M IN G  O R G A N I Z A T I O N  R E P O R T  N U M B E R ( S )
R-report # 1008; UILU-ENG 84-2202; ACT-46
5. M O N I T O R I N G  O R G A N I Z A T I O N  R E P O R T  N U M B E R ( S )
N/A
6a.  N A M E  OF P E R F O R M I N G  O R G A N I Z A T I O N
Coordinated Science 
Laboratory, Univ. of Illinois
Sb. O F F IC E  S Y M B O L  
(I f  applicable)
N/A
7a. N A M E  O F  M O N I T O R I N G  O R G A N I Z A T I O N
Joint Services Electronics Program
6 c .  A O D R E S S  (City. State and ZIP Code)
1101 W. Springfield Avenue 
Urbana, IL 61801
7b. A O O R E S S  (City, State and ZIP Codet
800 N. Quincy Street 
Arlington, VA
3a. N A M E  OF F U N O I N G / S P O N S O R IN G  
O R G A N I Z A T I O N
Joint Services Electronics----  Prngr q m
8b. O F F IC E  S Y M B O L  
(If applicable)
N/A
9. P R O C U R E M E N T  I N S T R U M E N T  I D E N T I F I C A T I O N  N U M B E R
Contract N00014-79-C-0424
8c. A O O R E S S  (City, State and ZIP C ode)
800 N. Quincy St. 
Arlington, VA
10. S O U R C E  OF F U N O I N G  NOS.
P R O G R A M  
E L E M E N T  NO.
";T,TuS " Z iT * ' “n ?  c u m in o ., The VLSI Optimality of the AKS Sorting Network N/A
P R O J E C T T A S K W O R K  U N I T
NO. NO. NO.
N/A N/A N/A
1 2 .  P E R S O N A L  A U T H O R ( S )
Bilardi, Gianfranco anc Preparata, Franco P.
13a. T Y P E  OF R E P O R T 13b. T I M E  C O V E R E O 14. O A T E  O F  R E P O R T  tY r., Mo., Day) 15. P A G E  C O U N T
F R O M  T O February 1984 11
16.. S U P P L E M E N T  A R Y  N O T A T I O N
N/A
17. C O S A T I  C O D E S
F I E L D G R O U P  1 SUB. GR.
1
18. S U 8 J E C T  T E R M S  tContinue on reverse if necessary and identify by block num ber)
Sorting networks, VLSI complexity, optimal VLSI networks
A VLSI implementation is given fo^the sorting network proposed by Ajtai, Komlos, and 
Szemeredi, which can be laid out in 0(n ) area and works in O(logn) time. This performance 
is optimal under the (synchronous) VLSI model of computation.
Z a  O IS T R I  8 U T I O N / A V A I  L A B I L I T Y  OF A B S T R A C T  
U N C L A S S I F I E D / U N L I M I T E D  S  S A M E  AS RPT. □  O T IC  USERS □
21. A B S T R A C T  S E C U R I T Y  C L A S S IF I C A T I O N
Unclassified
22a. N A M E  OF R E S P O N S IB L E  I N D I V I D U A L 22b. T E L E P H O N E  N U M B E R  
IInclude A rea C ode)
22c. O F F IC E  S Y M B O L
NONE
QD FORM 1473, 83 APR E O I T IO N  OF 1 JA N  73 IS O B S O L E T E .
security c l a SS I f ; c a “ ' o n ' = -i.-t
THE VLSI OPTIMALITY OF THE AKS SORTING NETWORK
G. Bilardi and F. P. Preparata 
Coordinated Science Laboratory 
University of Illinois at Urbana-Champaign
Introduction
Ajtai, Komlos, and Szemeredi [1] recently proposed a sorting network 
(referred to hereafter as the AKS network), of O(nlogn) comparators and 
O(logn) depth. Their construction is of great theoretical interest, for 
it shows that O(nlogn) comparisons suffice to sort n elements, even under 
the constraint that comparisons be nonadaptively executed in O(logn) parallel 
stages. At present, the AKS network appears not suitable for practical 
implementations, due to the large value of the constants; however, improve­
ments are conceivable that could make the network more attractive for 
real-world applications.
It is therefore natural to ask what is the performance of the AKS network 
in the synchronous VLSI model of computation which has been proposed [2] to 
capture the essential features of planar very large scale integration as a 
computing environment.
In this model it is known that any chip capable of sorting n words of
2 2 9length q = (l+a)logn, with a > 0, must satisfy the relationship AT = fi(n log^n), 
where A is the chip area, and T is the computation time. This lower bound 
has been originally obtained by Thompson [2] under the word local restriction 
(all the bits of the same word enter the circuit at the same input port) .
Recently Leighton [3] has shown that the lower bound holds valid even for 
non-word-local designs.
This work has been supported in part by the Joint Services Electronics Program 
under Contract N00014-79-C-0424 and by the IBM Predoctoral Fellowship Program.
2Many designs of VLSI sorters have already been proposed (see Thompson
[4] for a survey). We mention here the ones that achieve minimum area 
2 2 2A = 0(n log n/T ) at their computation time T:
- the mesh-connected [1,5,6] bitonic sorter [7], for T = 0(/i).
- the pleated-cube-connected-cycles (PCCC) [8] also implementing 
bitonic sorting for T in the range [ft(log n),0(/nlogn)].
- a hybrid architecture based on the cube-connected-cycles and the 
orthogonal trees interconnections [9], which implements the 
enumeration sorting schemes of [10] , and works in minimum computation 
time T = O(logn).
- a hybrid architecture consisting of orthogonal trees and permuter 
networks [3], which implements a generalization of the even-odd sort 
[7], and also works in time T = O(logn).
It is then interesting to see how the AKS algorithm, which is radically 
different from any other known sorting paradigm, compares with more classical
methods in the VLSI environment, where the heaviest demand of resources 
usually comes from communication, rather than from computing requirements, so 
that a small number of processing elements does not necessarily imply a good 
performance.
In this note we show that the AKS sorting network can indeed be laid out 
2in area A = 0(n ), while maintaining an O(logn) computation time, thereby 
establishing its optimality in the VLSI model of computation.
3Layout of the AKS Network
The original description [1] of the AKS network (with n inputs) is given 
in terms of an n-node graph G = (V,E), whose nodes are registers, and whose 
edges are comparators. The set of edges E is partitioned as E = E U U . . .  U E ^  
where each of the E^’s is a (possibly partial) matching on V, and N < 6 logn for 
some (very large) constant 3. Since each E ( s  = 1,...,N) is a (possibly 
partial) matching, all of its comparators can be simultaneously active. Thus 
the AKS sorting algorithm can be described as follows: 
begin for s := 1 to N
for all (x,y) € Eg, and x < y pardo
(R(x),R(y)):= (min(R(x),R(y)),max(R(x),R(y)))
end
where R(x) is the content of the register associated with node x.
Since the embedding of a graph in a planar grid requires nodes of bounded 
degree, we shall modify the original description as follows. According to a 
scheme described by Knuth [11], we consider n lines that run parallel, say,
to the horizontal axis. On line r (r = 1,2, ,n) there will be N processors
P[r> 1] > • • • >i)[r»N] , whose capability will be specified below. For each 
s = 1,2,...,N, and for each (x,y) € Eg, we connect processors P[x,s] and 
P[y,s] by a vertical line. Such vertical line supports the execution of the 
comparison-exchange (R(x),R(y)) : = (min(R(x),R(y)),max(R(x),R(y))) , where 
R(x) and R(y) are respectively the operands stored in P[x,s] and piy,s].
Once the comparison-exchanges specified by E^ have been executed, the 
results will be forwarded on each line (that is, from P[x,s] to P[x,s+1], 
x = 1,... ,n).
4This basic layout can be further specified by selecting the degree of 
parallelism of the operand transmission. Due to the amenability to 
pipelined operation, the q-bit operands are fed in bit-serial fashion 
starting with the most significant bit and each processor is equipped with 
a serial comparator. In each comparator, as long as the two inputs agree, 
they are transmitted to the next processor on the same line. As soon as a 
bit discrepancy is detected, a switch is set and, from then on, the remaining 
substrings of each of the operands will follow a fixed path independently 
of their value.
Thus we have ensured that the AKS network works in T = O(logn+q) = O(logn) 
time, and we turn our attention to the layout area. We first observe that 
both the horizontal, and the vertical lines are of 0(1) width. It is then 
simple to conclude that the height of the entire layout is 0(n). On the 
other hand, any matching of n lines can be easily laid out in (at most) n/2 
vertical tracks of constant width, by using a track for each edge of the
matching. Since there are N = O(logn) matchings to be cascaded in the AKS
2network, it is readily proved that O(nlogn) width, and therefore 0(n logn)
area, suffices for the layout. A closer analysis however, reveals that many
of the matchings E^,...,E^ are such that many edges can be laid out, without
overlap, in the same vertical track, yielding the conclusion that the bound
2for the area can be lowered to 0(n ).
To establish this claim we introduce the following top-down description 
of the layout of the AKS network. The layout could be analyzed as the assembly 
of suitable simpler building blocks, whose hierarchy is illustrated in Figure 1 
Each of these building blocks will now be described in detail, in a top-down
fashion.
PM !3 H
5
Depth
1 + 3  logn
1
log(l/n)
c
Figure 1. Hierarchy of building blocks of the AKS network. The depth is 
, expressed as the length of the cascade of blocks of the
immediately lower level.
(1) The AKS network on n = 2 inputs is the cascade of (l+3d) stages, 
called cherry stages, and denoted by s0 >sn *s12’S13’•••>sdl>sd?»S 
(Figure 2).
I
N
S
0
u
T
T
p
U
T
S
Figure 2. The AKS network on input is the cascade of (l+3d) cherry stages.
6(2) To each cherry stage St h (t = 1,... ,d; h = 1,2,3) there corresponds a
Par^^ -^ Q^n Pj. ^ integers (lines) l,2,...,n. Although the assignment
of the integers to the partition blocks is too complicated to be repeated
here (the reader is referred to [1]), what is important now are the
properties of P that are relevant to the layout. Specifically, Pt ,n t ,h
consists of the following (disjoint) blocks:
P = P = {Tc(2i,j): i = 0,1.... l(t-l)/2J ; j =
P = {Tt(2i-l),j): 1 = 1 , 2 , , lt/2j; j = 1,2,... 
To stage there corresponds the trivial partition P^ 
block only.
1,2 }
,22i_1} U {Tt(-1,0)}.
consisting of one
If we now define as span(T) the smallest interval of {l,...,n} containing 
T C  {l,...,n}, we have the following properties:
(1) For given t and i, and j' ^ j , span(T^(i,j)) H span(T^(i,j?)) - <i>.
(2) |span(T (i,j))| n/21 for every t and j.
(3) jT^(i,j) | <_ y n/2i A1 C for every j, where y and A = 2a > 1 are 
constants.
The lines numbered by the integers in a block Tt(i,j) are involved in a 
network of comparators called an n-nearsorter (see Figure 3). Properties (1) and 
(2) show that for any fixed t and i, all n~nearsorters corresponding to 
(Tt(i,j) : j = 1,2,...,21 } can be laid out in the same vertical strip
as shown in Figure 3. Moreover, all nearsorters in the same cherry stage 
can operate in parallel (indeed, no two share a line).
7St2
Figure 3. Typical cherry stages and S ^  (t is even in the figure) .
The region labelled T (i,j) correspond to the layout of an 
p-nearsorter.
(3) An p-NEARSORTER, corresponding to block T (i,j), has the structure 
of a full binary tree of depth log^ Each node of this tree is a
network of comparators, called an e-HALVER (see (4)), encompassing an 
interval of lines (Figure 4). If m = |T^ _(i,j) | » then the root encompasses 
m lines; if a node v of the tree encompasses s lines, then 
its two offsprings encompass each (approximately) s/2 lines.
8nm inputs 
£-HALVER
Figure 4. An p-NEARSORTER is a full binary tree of £ -HALVERS.
(4) An s-HALVER stage on m lines (with £ < n/(log 1/n)) consists of the 
cascade of c (where c is a function of £, but is independent of m) 
one-factor stages (matching stages). (When the network is viewed as 
a graph G = (V,E), i.e. when each line is shrunk to a single node, the 
e-HALVER becomes an expander graph on the set of nodes on which its 
edges are incident.) (See Figure 5.)
9Figure 5. An e-HALVER is a cascade of a constant number of one-factors.
(5) Finally a one-factor stage on m lines is a matching between the lower 
and the upper half of these lines, and it is a subset of exactly one 
of the sets {E^: s = 1,...,N} introduced earlier. (See Figure 6.)
Figure 6. A one-factor is a matching between the top and the bottom half 
of lines.
10
Now we proceed, bottom-up, to analyze the area of the network.
(i) A one-factor stage on m lines can be laid out in 0(m) length, by
allocating a vertical track for each of the m/2 edges. The height
of the layout will be proportional to the distance between the 
topmost and the bottommost of the input lines.
(ii) An e-HALVER has a length of 0(cm); c is the valence of the e-HALVER.
(iii) An n-NEARSORTER has a length also of 0(cm), since the length
of the e-HALVERS decreases geometrically with the level.
(iv) We now subdivide the layout into vertical slabs, with slab(t,i) 
containing the nearsorter on sets Tt(i,j) for all suitable values 
of j. (There are in fact two identical copies of Tt(i,j) when i is 
even, but this will only affect constant factors.) From point (iii) 
and property (3) it immediately follows that
£(t,i) = length of slab(t,i) <_ y 2 1Ai t  
Then, the total length £ can be obtained by summing £(t,i) over all the 
vertical slabs:
d t d d
£ =  E E £(t,i) = E E £(t,i)
t=0 i=0 i=0 t=i
d d
1  Y n E 2 1 E (l/A)* 1 <_ T17T7TT n. 
i=0 t=i 1 U /  '
2In conclusion A = height x length = 0(n) x 0(n) = 0(n ) as claimed.
11
References
1. M. Aitai, J. Komlos, E. Szemeredi, MAn O(NlogN) Sorting Network," Proc. 
15th SIGACT, Boston, MA, April 1983, pp. 1-9.-
2. C. D. Thompson, A Complexity Theory for VLSI, Ph. D. Thesis, Computer 
Science Department, Carnegie-Mellon Univ., Aug. 1980.
3. F. T. Leighton, "Tight Bounds on the Complexity of Parallel Sorting," 
Proc. 16th SIGACT, Washington, D.C., April 1984.
4. C. D. Thompson, "The VLSI Complexity of Sorting", IEEE Trans. Comp., 
vol. C-32, no. 12, Dec. 1983.
5. C. D. Thompson and H. T. Kung, "Sorting on a Mesh Connected Computer," 
Comm, of ACM, voi. 20, no. 4, pp. 263-271, April 1977.
6. D. Nassimi and S. Sahni, "Bitonic Sort on a Mesh-Connected Parallel 
Computer," IEEE Trans, on Computers, vol. C-28, no. 1. pp. 2-7.
Jan. 1979.
7. K. E. Batcher, "Sorting Networks and Their Applications," Proc. AFIPS 
Spring Joint Computer Conference, voi. 32, pp. 307-314, April 1968.
8. G. Bilardi, F. P. Preparata, "A VLSI Optimal-Architecture for Bitonic 
Sorting," Proc. 7th Conf. on Information Sciences and Systems, The 
Johns Hopkins University, Baltimore, MD, (March 1983); pp. 1-5.
9. G. Bilardi, F. P. Preparata, "A Minimum Area VLSI Architecture for 
O(logn) Time Sorting," Proc. 16th SIGACT, Washington, D. C., April 1984
10. F. P. Preparata, "New Parallel Sorting Schemes," IEEE Trans. Comput., 
vol. C-27, no. 7, pp. 669-673, July 1978.
11. D. E. Knuth, The Art of Computer Programming: Sorting and Searching, 
Voi. 3, Reading, MA: Addison-Wesley 1973.
Keywords: VLSI complexity, area-time trade-off, sorting networks,
optimal algorithms, parallel computation.
Keywords: VLSI complexity, area-time trade-off, sorting networks,
optimal algorithms, parallel computation.
