The VLSI Complexity of Sorting by Clark D
IEEE TRANSACTIONS ON COMPUTERS, VOL. c-32, NO. 12, DECEMBER 1983
The VLSI Complexity of Sorting
CLARK D. THOMPSON
Abstract-The area-time complexity ofsorting is analyzed under
an updated modelofVLSI computation. The new modelmakes a dis-
tinction between "processing" circuits and "memory" circuits; the
latter are less important since they are denser and consume less power.
Other adjustments to the model make it possible to compare pipelined
and nonpipelined designs.
Using the new model, this paper briefly describes thirteen different
designs for VLSI sorters. (None ofthese sorters is new, but few have
been laid out or analyzed in a VLSI model.)The thirteen sorting cir-
cuits are used to document the existence ofan area * time2 tradeoff
for the sorting problem. The smallest circuit is only large enough to
store a few elements at a time; it is, ofcourse, rather slow at sorting.
The largest design solves an N-element sorting problem in only O(lg
N) clock cycles. The area * time2 performancefigure for all but three
ofthe designs is close to the theoretical minimum value Q(N2 Ig N).
Index Terms-Area-time complexity, bitonic sort, bubble sort,
heapsort, mesh-connected computers, parallel algorithms, shuffle-
exchange network, sorting, VLSI, VLSI sorter.
I. INTRODUCTION
S ORTING has attracted a great deal ofattention over the
past few decadesofcomputer science research. It is easy
to see why: sorting is a theoretically interesting problem with
a great deal ofpractical significance. As many as a quarter of
the world's computing cycles were once devoted to sorting [19,
p. 3]. This is probably no longer the case, given the large
number of microprocessors running dedicated control tasks.
Nonetheless, sorting and other information-shuffling tech-
niques are ofgreat importance in the rapidly growing database
industry.
The sorting problem can be defined as the rearrangement
of N input values so that they are in ascending order. This
paper examines the complexity of the sorting problem, as-
suming it is to be solved on a VLSI chip. Much is already
known about sorting onothertypesofcomputational structures
[19, pp. 1-388], and much ofthis knowledge is applicable to
VLSI sorting. However, VLSI is a novel computing medium
in at least one respect: the size of a circuit is determined as
much by its intergate wiring as by its gates themselves. This
technological novelty makes it appropriate to -reevaluate
sorting circuits and algorithms in the contextofa "VLSI model
ofcomputation."
Using a VLSI model, it ispossible todemonstrate the exis-
tence of an area * time2 tradeoff for sorting circuits. A pre-
liminary study of this tradeoff is contained in the author's
Manuscript received March 15, 1982; revised June 10, 1982. This work was
supported by the National Science Foundation under Grant ECS-81-
10684.
The author is with the Computer Science Division, University ofCalifornia,
Berkeley, CA 94720.
Ph.D. dissertation [42], in which two sorting circuits were
analyzed. This paper analyzes eleven additional designs under
an updated model ofVLSI computation. The updated model
has the advantage of allowing fair comparisons between
pipelined and nonpipelined designs.
None ofthe sorting circuits in this paper is new, since all are
based on commonly known serial algorithms. All have been
proposed before for hardware implementation. However, this
is the first time that most ofthese circuits have been analyzed
for their area and time complexity in a VLSI implementation.
Ten ofthe sorters will be seen to have an area * time2 perfor-
mance in the range O(N2 Ig2 N) to O(N2 lg5 N). Since it is
impossible for any design to have an area * time2 product of
less than Q(N2 lg N) [44], these designs are area- and time-
optimal to within logarithmic factors.1
A number ofdifferent models for VLSI have been proposed
in the past few years [5], [7], [16], [25], [37], [42], [43], [46].
They differ chiefly in their treatment of chip I/O, placing
various restrictions on the way in which a chip accesses its
input. Typically, each input value must enter the chip at only
one place [42] or at only one time and place [5]. Savage [36]
has characterized these as the "unilocal" and "semelective"
assumptions, respectively.
The model ofthis paper builds on its predecessors, removing
as many restrictions on chip I/O as possible. Following Kedem
and Zorat [16], the unilocal assumption could be relaxed by
allowing a chip to access each inputvalue from k different I/O
memories. Kedem's proof suggests that clever use of such
"multilocal" inputs could improve a chip's area * time2 per-
formance by a factor of k2. Unfortunately, there seem to be
neither interesting sorting circuits that take advantage of
multilocal data, nor examples ofnaturally occurring multilocal
inputs. Thus, the new model retains the unilocal assump-
tion.
The semelective assumption is much lessjustifiable than the
unilocal assumption. It isperfectly feasible todesign achipthat
makes multiple accesses to problem inputs, outputs, and in-
termediate results contained in off-chip memory. In a break
from previous practice in theoretical VLSI models, the area
of this off-chip memory is not included in the total area of a
chip. This serves toclarify the area * time2 tradeoffforsorting
circuits; memory area seems to be involved in a (lgarea) * time
tradeoff, at least for circuits with fixed I/O bandwidth and
small amounts ofmemory [13]. Leaving memory area out of
the new model permits the analysis ofsublinear size circuits.
Knuth's notation for the base-two logarithm Ig Alog2 is used throughout
this paper. See [20] for standard definitions of order notation for lower
(Q( )), exact (O( )), and upper (O( )) bounds.
0018-9340/83/1200-1171$01.00 © 1983 IEEE
1 171
Authorized licensed use limited to: The University of Auckland. Downloaded on April 16,2010 at 03:36:28 UTC from IEEE Xplore.  Restrictions apply. IEEE TRANSACTIONS ON COMPUTERS, VOL. C-32, NO. 12, DECEMBER 1983
It also makes the model's area measure more sensitive to the
power consumption ofa circuit, since memory cells have a low
duty cycle and generally consume much less power per unit
area than do "processing" circuits.
Other authors have used nonsemelective models, although
none has elaborated quite so much on the idea. Lipton and
Sedgewick [25] point out that the "standard" AT2 lower
bound proofs do not depend on semelective assumptions. Hong
[14] defines a nonsemelective model ofVLSI with a space-time
behavior which is polynomially equivalent to that of eleven
other models ofcomputation. His equivalence proofs depend
upon the fact that VLSI wiring rules can cause at most a
quadratic increase in the size of a zero-width-wire circuit.
Unfortunately, Hong's transformation does not necessarily
generate optimal VLSI circuits from optimal zero-width-wire
circuits, since a quadratic factor cannot be ignored when
"easy" functions like sorting are being studied.
Lipton and Sedgewick [25] point out another form of 1/0
restriction, one that is not removed in this paper's model. In
most situations it is natural to restrict one's attention to circuits
which produce their outputs at fixed locations. For example,
the most significant bit of the largest output value in a
"where-oblivious" circuit might be constrained to appear at
I/O port 1, regardless ofthe problem inputs.
Adoption of the where-oblivious restriction begs an im-
portant theoretical question: what does "sorting" mean? Is it
determining the rank order ofthe inputs? Is it permuting the
inputs into sorted order, given their ranks? Or does it involve
both ranking and permuting? The where-oblivious restriction
adopted above implies an affirmative answer tothe last ques-
tion.
From a practical point ofview, a sorting circuit should be
required to rank and permute its inputs. It is possible to con-
ceive of uses for circuits that can only rank or can only per-
mute, but most applications require both. For example, con-
sider the problem ofremoving the duplicate records that are
frequently produced by projection operations on a relational
database. The duplicates can be detected and purged in a
straightforward fashion once the records are permuted into
rankorder (on any remaining key), thereby saving space and
time in later database operations.
A historical argument can also be made in favor ofrequiring
a "sorting" circuit to rank and permute its data. Theoriginal
meaning of "sorting" is "separating or arranging according
to class or kind" [19, p. 1], a process thatwould seem to involve
both classification and movement. Thus, for practical and
historical reasons, the interesting study of"ranking circuits"
is left to other papers and other authors. Only (ranking and
permuting) = "sorting" circuits are studied here.
The catalog ofI/O restrictions is not yet complete. In both
Vuillemin's [46] and Thompson's [43] models of pipelined
VLSI computation, analogous inputs and outputs for different
problems must be accessed through identical I/O ports. For
example, input 1 ofproblem 2 must enter thechip at the same
place as input 1 ofproblem 1. While this seems to be a natural
assumption for a pipelined chip, it leads to a number of mis-
leading conclusions about the optimality ofhighly concurrent
designs. For instance, the highly parallelized bubble sortdesign
ofSection III-L is nearly area * time2 optimal under the old
models, but it is significantly suboptimal under the model of
this paper.
When the restriction on pipelined chip inputs is removed,
it becomes impossible to prove an Q(N2 lg N) lower bound on
area * time2 performance until the definitions ofarea and time
are adjusted.
In the new model, the area performanceA ofa design is its
"area per problem," equal to its total processing area Aprocessing
divided by its degree ofconcurrency c. Thus, it does not matter
how many copies of a chip are being considered as a single
design: doubling the number of chips doubles both its con-
currency and its total area, leaving its area performance in-
variant. The old definition ofarea performance was the total
area of a design (including its "memory area") with no cor-
rection factor for its concurrency.
The time performance of a design can be measured in a
number ofways. Vuillemin [46] concentrates on the data rate
D (or I/O bandwidth), a measure of a circuit's throughput.
While this is an important parameter, a circuit's data rate is
not a very useful definition of time performance under this
paper's unrestricted I/O model. A design consisting of two
identical sorting chips would have twice thedata rateofeither
of its chips considered separately. As with the area measure
discussed above, it would be possible to "normalize" a circuit's
data ratebydividing by itsconcurrency. The resulting measure
defies easy interpretation; fortunately, a better measure is
available.
Vuillemin [46] also considers the period Tp between suc-
cessive presentations ofcomplete sets ofproblem inputs. This
measure is closely related to a circuit's data rate D, for it is
numerically equal to the problem size (in bits) divided by D.
As such, it suffers from similar difficulties of interpretation
when inputs are not assumed to form a single stream.
The time measure adopted here is thedelay Td between the
presentation ofone set ofproblem inputs and the production
of the outputs from that problem. This measure is obviously
unaffected by replication: two sorting chips have the same
delay as one. In defense ofthedelay measure, it can be argued
that a design's delay is a more fundamental limitation than its
data rate or period. As indicated above, the latter measures
can always beimproved by replicating thedesign and splitting
its 1/0 stream. Also note that an upper bound result on Td
implies an upperbound result on Tp. That is, a circuit's period
need be no larger than its delay, since idle time serves no useful
purpose in this paper's model ofcomputation.
As Vuillemin [46] notes, any lower bound on circuit area
in terms of its period Tp or data rate D immediately implies
a similar bound in terms ofits delay Td. By this reasoning, Tp
is the stronger time measure for lower bounds,just as Td is the
stronger time measure for upper bounds. Vuillemin's ATp =
Q(N2) lower bound on the complexity ofsorting N numbers
thus implies an analogous result for the Td metric. His proof
must also beadapted tothispaper's metricofarea performance
and to its less-restricted model ofI/O behavior. However, one
can do better than an AT3 = Q(N2) result. Vuillemin's
"transitivity" argument for sorting is somewhat weak in the
sense that it only measures the complexity of permuting the
172
Authorized licensed use limited to: The University of Auckland. Downloaded on April 16,2010 at 03:36:28 UTC from IEEE Xplore.  Restrictions apply. THOMPSON: VLSI COMPLEXITY OF SORTING
least significant bit ofeach input word. By considering clg N
ofthe least significant bits, it is possible to show that AT'
Q(N2 lg N) for the problem ofsorting Nwords of(1 + c)lg N
bits each [44].
There is reason to believe that an even stronger lower bound
is obtainable. An AT2 = Q(N2 lg2 N) result has been shown
for a slightly more restricted model ofI/O, in which all the bits
ofeach input value must be read through the same I/O port
[42]. (In the present model, bit 0 ofan input could be read on
the other side ofthe chip from bit i ofthat input.) It is difficult
to imagine how a circuit could take advantage of such a
"nonlocalized" I/O pattern and thus subvert the Q(N2 lg2 N)
lower bound. Indeed, all ofthe circuits presented in this paper
access the bits ofeach input value in a localized fashion, and
thus none do better than AT3 = O(N2 lg2 N).
This paper is organized in the following fashion: Section II
discusses the new VLSI model of computation, then defines
it precisely; Section III sketches thirteen different designs for
VLSI sorters and analyzes the area-time performance ofeach;
Section IV compares the performances ofeach ofthedesigns,
with some discussion ofthe "constant factors" ignored by the
asymptotic model; and Section V concludes the paper with a
list of some ofthe open issues in VLSI complexity theory.
In an attempt to keep the paper to a reasonable length, the
constructions ofSection III aredescribed as briefly as possible.
Readers wishing to "fill in thedetails" will have to follow the
references, where applicable, and then exercise their own in-
genuity. This is a regrettable situation, but an inevitable one
since there is no accepted "high-level design language" for
VLSI.
II. MODEL OF VLSI COMPUTATION
In all theoretical models ofVLSI, circuits are made oftwo
basic components: wires and gates. A gate is a localized set of
transistors, or other switching elements, which perform a
simple logical function. For example, a gate may be a "j-k
flip-flop" ora "threeinputNAND." Wiresservetocarrysignals
from the output ofone gate to the input ofanother.
Two parameters ofa VLSI circuit are ofvital importance,
its size and its speed. Since VLSI is essentially two-dimen-
sional, thesize ofa circuit is best expressed in termsofits area.
Sufficient area must be provided in a circuit layout for each
gate and each wire. Gates are not allowed tooverlapeach other
at all, and only two (or perhaps three) wires can pass over the
same point.
A convenient unit of area is the square of the minimum
separation between parallel wires. In the terminology of [26],
this paper's unit ofarea isequal to (4 X)2, where X is a constant
determined by the processing technology. Each unit of area
thus contains one, two, or three overlapping wires; or else it
contains a fraction ofa gate. The actual size ofthis area unit
becomes smaller as technology improves. In 1978, it was typ-
ically 150 ,Am2 = 1.5 X 10-6 cm2; eventually, it may be as
small as 0.4Am2 [26, p. 35].
The speed ofa synchronous VLSI circuit can be measured
by the number ofclock pulses it takes tocomplete its compu-
tation. Once again, the actual size ofthis time unit is a tech-
nological variable. In 1978, a typical MOS clock period was
30-50 ns; and this may decrease to as little as 2-4 ns [26]. For
the superconducting technology ofJosephsonjunctions, a clock
period of 1-3 ns is achievable today, using a process for which
the area unit is 25 ,um2 [17].
The speed ofa VLSI circuit may be adversely affected by
the presence ofa very long wires, unless special measures are
taken. In many VLSI processes, a minimum-sized transistor
cannot send a signal from one end ofthe chip to the other in
one clock period. To accomplish such unit-delay cross-chip
communication, and to achieve large fanouts, special "driver"
circuits are employed. These drivers amplify the current ofthe
signal; O(lg k) stages ofamplification are required todrive a
length-k wire t26, p. 14] or to drive k inputs. The use of
O(lg k)-stage drivers is reflected in the VLSI model's loading
rules, as formalized in Assumption If). Under these rules, each
stage of a driver has twice as many unit-area gates as the
preceding one. The gates are wired in parallel, so each stage
has twice the current sourcing (or sinking) capability of its
predecessor. Furthermore, thestages areindividually clocked.
Thus, the driver behaves like an O(lg k)-bit shift register with
unit bandwidth and unit-power input requirements. Every
wire, even the longest one, had a throughput ofone bit per time
unit; however, a length-k wire has Td = O(lg k) delay. As
argued in [43], the area ofthe long-wire driver circuits can be
ignored (uptoconstant factors), aslong asthereis roomtolay
out the wires they drive.
Assumption If(the "logarithmic delay assumption") is used
here because it leads to realistic circuit designs and time
bounds, as well as to an interesting theoretical question. As it
turns out, the time bounds obtained for VLSI sorting under
this assumption are rarely different from the ones that would
be obtained under a "unit-delay" assumption (in which each
gate is able to transmit its output all the way across thecircuit,
in one clock period). Theonly exceptions might bethehighly
parallelized versions of the long-wire networks discussed in
Sections III-J and III-K. Driver delays will exceed single-bit
comparison delays in these networks unless one is able to lay
them out with the smallest possible maximum wire-lengths.
This is an attractive open problem, at least for the N-vertex
shuffle-exchange network ofSection III-J: can it be embedded
with a maximum wire length ofonly O(N/1g2 N)?
From a practical standpoint, it may be argued that the
logarithmicdelay assumption is either too severeor toolenient,
depending on the technology. The former is currently the case
in the F2L and Josephsonjunction processes [11], [17]. As of
now, both are really unit-delay technologies. Minimum-sized
gates can drive wires the entire length of a chip without sig-
nificantly increasing logic delay. fHowever, the results ofthis
paper still apply if the drivers are omitted from the circuit
constructions of Section III.
It seems unlikely that the logarithmic delay assumption will
ever be too lenient on synchronous MOS circuits. Seitz [38]
projects a signal transmission velocity of (1 cm)/(3 ns) in a
fully developed MOS technology. This means that a cross-chip
communication will only take a few clock periods, even ifthe
"chip" is as large as a present-day "wafer." In other words, the
time performance of the fully developed MOS technology is
only slightly overestimated by thelogarithmicdelay assump-
1 1 73
Authorized licensed use limited to: The University of Auckland. Downloaded on April 16,2010 at 03:36:28 UTC from IEEE Xplore.  Restrictions apply. 1174 IEEE TI
tion. The true delay would best be modeled as logarithmic plus
a small constant. Modeling delay as a linear function of dis-
tance, as suggested by Chazelle and Monier [7], would greatly
exaggerate the importance ofdelay in the determination ofthe
speed ofsuch circuits.
If circuits ever become much faster or much larger than
envisioned today, the logarithmic delay assumption should be
modified. As a case in point, consider the Josephsonjunction
circuit assemblies currently built by IBM. They are 10cm on
a side, and they run on a 1-3 ns clock [17]. Thewires in these
circuits are superconductors, but, of course, they cannot
transmit information at a velocity greater than (a fraction of)
the speed oflight. Right now, the clock frequency and circuit
dimensions arejust small enough to allow a signal to propagate
from one sideofthecircuit totheotherin oneclockperiod. Any
increase in either speed or sizewould make this impossible. The
computational limitations ofsuch enhanced (and hypothetical)
technologies could be analyzed under Chazelle and Monier's
linear delay assumption.
Thus far in the discussion, only "standard features" have
been introduced to the VLSI model. The interested reader is
referred to [42] for more details on the practical significance
ofthe model, and to [37] for an excellent introduction to the
theoretical aspects of VLSI modeling.
As noted in the introduction, a major distinction between
the model ofthis paper and most previous VLSI models is the
way in which it treats "I/O memory." Here, only a nominal
area charge is made for the memory used to store problem
inputs and outputs, even if this memory is also used for the
storage ofintermediate results.
In the new model, each input and output bit is assigned a
place in a k-bit "I/O memory" attached to one or more "I/O
ports." Two types of access to the I/O memory are distin-
guished. If the bits are accessed in a fixed order, the I/O
memory is organized as a shift register and accessed in 0(1)
time per bit. Ifthe access pattern is more complex, a random
access memory (RAM) is used. Such a memory has an access
delay of 0(lg k) [26, p. 321]. The random access time covers
both the internal delays ofthe memory circuit as well as the
time it takes the I/O port to transmit (serially) O(lg k) address
bits to the RAM.
This paper's serial 1/0 interfaces may seem a bit artificial,
since typical RAM's are accessed with word-parallel address
and data lines. More careful consideration reveals that any
such word-parallel RAM interface could be simulated with
several serial interfaces at no cost in asymptotic area and
time.
Allowing more than one 1/0 port to connect to a single I/O
memory makes it easy to model the use ofmultiport memory
chips. Their usage must be restricted, however, to remove the
(theoretical) temptation to use multiport memories and
printed-circuit board wiring as a means of avoiding on-chip
wiring. (Note that a two-port memory provides a communi-
cation channel between its two 1/0 ports, eliminating any need
for an on-chip wire between them.) The restriction is that all
I/O ports connecting to a single memory must be physically
adjacent to each other in the chip layout. This avoids any
possibility of"rat's-nest" wiring to the memory chips, making
the model essentially unilocal rather than multilocal [36] in
its 1/0 assumptions.
RANSACTIONS ON COMPUTERS, VOL. C-32, NO. 12, DECEMBER 1983
The rnodel makes a few assumptions as possible about the
actual location ofthe I/O memory circuitry, even though this
can have a large effect on system timing. If the memory is
placed on a different chip from the processing circuitry, its
access time is considerably increased. Fortunately, this will
not always invalidate the model's timing assumptions. The
O(lg k) delay of a k-bit RAM will dominate the delay ofan
off-chip driver, ifk is large enough. Alternatively, ifk is small,
it should be relatively easy to locate the RAM on the processor
chip. As for off-chip "shift register" 1/0 memories, there
should be no particular difficulty in implementing these in such
a way that one input or output event can happen every 0(1)
time units.
As indicated above, time charges for off-chip 1/0 are
problematical and may be underestimated in the current
model. Area charges for I/O are also troublesome. Here, I/0
ports are assumed to have 0(1) area even though they are
obviously much larger than a unit-area wire crossing or an
0(1) area gate. It is also assumed that a design can have an
unlimited number of I/O ports. In reality, chips are limited
to one or two hundred pins, and each pin should be considered
a major expense (in terms of manufacturing, reliability, and
circuit board wiring costs). An attempt is made in Section IV
to use more realistic estimates of 1/0 costs when evaluating
the constructions in Section III.
The complete model of VLSI computation is summarized
below.
Assumption I-Embedding:
a) A VLSI circuit consists ofwires and nodes embedded
in the Euclidean plane. In graph-theoretic terms, wires are
hyperedgesjoining two or more nodes. These hyperedges are
embedded as a tree of(two-ended, straight-line) wire segments
and arbitrarily positioned "fan-out points."
b) Wire segments have unit width; the length of a wire is
the total length of all its wire segments.
c) At most two wires may cross over each other at any point
in the plane.
d) A node occupies 0(1) area. Thus, a node has at most
0(1) input wires and 0(1) output wires. (In general, a node
can implement any Boolean function that is computable by a
constant numberofTTL gates. Hence, an "and gate" or a "j-k
flip-flop" is represented by a single node: see Assumption
4.)
e) Wires may not cross overnodes, nor can nodes cross over
nodes.
f) Unboundedly long wires and large fan-outs are permis-
sible under the following loading rule. A length-k wire may
serve as the input wire for n nodes only ifit is the output wire
for at least cWk + cgn nodes. (In electrical terms, the output
nodes work in parallel as current sources or sinks for the
capacitive load represented by a wire. The "loading constants"
CW, Cg are always less than one otherwise it would be im-
possible to connect two nodes together but their precise
values are technology-dependent.)
Assumption 2-Problem Definition:
a) A chip has degree ofconcurrency c ifit solves c problem
instances simultaneously.
b) Each oftheNinput variables in a problem instance takes
on one of M different values with equal likelihood.
c) M = NI+e, for some fixede > 0. Furthermore, a nearly
Authorized licensed use limited to: The University of Auckland. Downloaded on April 16,2010 at 03:36:28 UTC from IEEE Xplore.  Restrictions apply. THOMPSON: VLSI COMPLEXITY OF SORTING
nonredundant code is used, so that each input and outputvalue
is represented as a word with _(1 + E)(lg N) = O(lg N) bits.
(This assumption makes it possible to express area and time
bounds in terms ofN alone. It also seems to be required in the
proofofa strong lower bound on the area * time2 complexity
ofthe sorting problem [44].)
d) The output values of a problem instance are a permu-
tation ofits input values into increasing order.
Assumption 3-Timing:
a) Wires have unit bandwidth. They carry a one-bit "sig-
nal" in each unit oftime.
b) Nodes have 0(1) delay. (This assumption, while real-
istic, is theoretically redundant in view ofAssumption 3a).)
Assumption 4-Transmission Functions:
a) A transmission function is associated with each node,
defining how its outputs and internal state react to the signals
on its input wires. More precisely, the "state" of a node is a
bit-vector that is updated every time unit according to some
fixed function ofthe signals on its input wires. Similarly, the
signals appearing on theoutput wires ofa node are some fixed
function of its current state. (With this definition, a node is
seen to have the functionality ofa finite state automaton ofthe
Moore variety.)
b) Ifoutputs from two or more nodes are connected to the
same wire, their output signals must neverdisagree. To ensure
this, the nodes must have identical transmission functions and
beconnected to the same input wires. (A weakerversion ofthis
assumption allows "or-tying" ofoutput wires [44].)
c) Nodes fall into three classes: logic nodes, I/O ports, and
I/O memories. I/O memories are further classified as either
"RAM-type" or "shift-register-type" memories.
d) I/O memories may not be connected directly to logic
nodes.
e) Logic nodes and I/O ports are limited to 0(1) bits of
state.
f) A (kl X k2)-bit I/O memory contains k Ik2 + lg k1k2
bits ofstate. An "address register" is formed oflg kIk2 bits of
state. The other k1k2 bits in the state vector are "data."
g) Each problem input bit is assigned to a fixed (i.e.,
problem-independent) position in some I/O memory's data
area. Problem inputs are initially available only at these po-
sitions. At the beginning ofa computation, all other state bits
are initialized to fixed, problem-independent values.
h) Each problem output bit is assigned to a fixed position
in an I/O memory. At the completion of a computation, the
memory data corresponding to the output bits must have the
values defined in Assumption 2. (Note that Assumptions 4a),
4b), 4g), and 4h) make the model deterministic, where-
oblivious [25], and unilocal [36].)
i) 1/0 ports connected to RAM-type (k1 X k2)-bit
memories can run a memory cycle every O(k2 + lg kI) time
units. (These memory cycles are allocated on a first-come,
first-serve basis among the 0(1) ports connected to a single
memory.) During thefirst lg k1 time units ofa cycle, theport
receives a bit-serial "word address" on an inputwire. Thenext
input signal is interpreted as a read/write indicator. Ifa write
cycle is indicated, the following k2 input signals are written
into the address word in the memory. Alternatively, during the
last k2 time units of a read cycle, the value of the addressed
word appears in bit-serial format on the 1/0 port's output
wire.
j) I/O ports connected to shift-register-type (kI X k2)-bit
memories can run a memory cycle every 0(k2) time units. As
in the case ofa RAM-type memory, cycle requests are served
in a FIFO basis. During the first (third, fifth, etc.) time unit
of a cycle, the value of the currently addressed data bit is
available on the port's output wire. During even-numbered
time units ofa cycle, the signal appearing on the port's input
wire is written into the addressed data bit, and the memory's
address register is incremented mod k,k2.
Assumption 5-Area, Time Performance:
a) The total processing area Aprocessing of a chip is the
number of unit squares in the smallest enclosing rectangle.
b) The area performanceA ofa chip is its total area divided
by its degree ofconcurrency c.
c) The total time Ttotal is the average number oftime units
that elapse between the beginning and end of a computation
on c problem instances.
d) The time performance Td ofa chip is the average (over
all problem instances) ofthe number oftime units between the
first and last memory cycles accessing the data in a particular
problem instance.
e) The period Tp of a chip is equal to Ttotal divided by c.
Note that Tp < Td < Ttotal, since a chip with Tp > Td must
be "wasting time" between successive problem instances, and
thus could be redesigned to have Tp = Td. Also note that ifc
= 1, then Tp = Td = Ttotal0
f) The I/O bandwidthofasortingchipis the (average) total
number ofbits read or written into its I/O memories, divided
by Ttotal.
III. CIRCUIT CONSTRUCTIONS
This section presents thirteen constructions forsorting chips.
Each will be briefly described in its own subsection. First,
however, we present a few useful building blocks.
A serial comparison-exchange module (a "comparator")
can be built of 0(1) gates [27] in A = 0(1) area. It has two
bit-serial data inputs, A and B, and two bit-serial data outputs,
max(A, B) and min(A, B). These inputs and outputs are
serialized in a binary code, most-significant bit first.
In some applications, two control lines are added to the
comparison-exchange module. The four control states are 1)
unconditionally "pass-through" the two inputs; 2) uncondi-
tionally "swap" the two inputs; 3) send the larger of its two
inputs to output 1; 4) send the smaller of its two inputs to
output 1. These morecomplex modules can still fitin 0(1) area
and produce two output bits every 0(1) time units.
Comparison-exchange modules may be pipelined, as illus-
trated in Fig. 1 for the case ofseven-bit words. Pairs ofinput
values enter the module from the top, and move downwards
through the array at the rate ofone row per time unit. In each
row, the circular element performs a comparison-exchange
operation on one bit of the inputs; the square elements pass
their inputs through unchanged. Information about the "di-
rection" of the comparison-exchange for each pair of input
values travels diagonally through the array, from one circle
to the next.
A pipelined comparison-exchange module for O(lg N)-bit
1 175
Authorized licensed use limited to: The University of Auckland. Downloaded on April 16,2010 at 03:36:28 UTC from IEEE Xplore.  Restrictions apply. IEEE TRANSACTIONS ON COMPUTERS, VOL. C-32, NO. 12, DECEMBER 1983
DATA INPUTS
MSB LSB
CONTROL
INPUTS 00000 MSB COMPARE-
EXCHANGE
0 0 0 0 0
0 0 0000
000 0 0 0
000 00
0 0 000 0
LSB COMPARE-
EXCHANGE
DATA OUTPUTS
Fig. 1. A pipelined comparison-exchange module.
words can do a complete comparison-exchange operation every
Tp = 0(1) time units with a delay of Td = 0(lg N) time units.
The total area ofthe comparator as drawn is O(lg2N), and its
concurrency is lg N, giving it an area performance ofA = 0(lg
N). In most applications the square boxes can be deleted, since
the input and output data are already "staggered." This re-
duces Aprocessing to 0(lg N), giving the pipelined comparator
an area performance ofA = 0(1). Note that this is identical
to the area performance ofthe nonpipelined comparator.
A third building block is the programmed control unit
(PCU). A PCU is used to generate a large number ofcontrol
signals from a very small area. In the constructions below,
entire sorting algorithms are encoded into 0(1) PCU in-
structions. Each instruction is 0(lg N) bits long, and executes
in Td = 0(lg N) time units. The instruction set includes
branches, arithmetic operations (shifts, adds, and negations),
tests, and register-register moves. A PCU has 0(1) different
registers. Oneofthese registers isconnected tothecontrol lines
of a comparison-exchange/module. Another register is used
togenerate address and control signals for any I/O ports in the
vicinity.
In the constructions below, the term "bit-serial processor"
is used to denote the combination ofa PCU, 0(1)1/0 ports,
and a bit-serial comparison-exchange module. Each processor
can fit into an 0(l)-by-0(lg N) unit rectangle, and can per-
form one comparison-exchange operation every Tp = Td =
O(lg N) time units.
"Word-parallel processors" are used to augment the per-
formance ofsome ofthe designs. A word-parallel processor is
constructed from a PCU, a pipelined comparison-exchange
module, and 0(lg N) I/O ports connected to shift-register
memories. (There seems to be no reason to use a parallel pro-
cessor with a random-access memory, since thedelayofa serial
processor matches the delay of an O(N)-word RAM.)
A word-parallel processor can perform one comparison-
exchange operation every Tp = 0(1) time units, for its inputs
are easily "staggered" in the manner required by its pipelined
comparison-exchange module. Finally, a word-parallel pro-
cessor fits into an 0(1)-by-O(lg N) units rectangle. It thus
occupies the same area as does a serial processor, to within
constant factors.
Now we are ready to examine sorting circuits for VLSI. The
designs are presented in order ofincreasing parallelism.
A. Uniprocessor Heapsort
This is the smallest sorter imaginable. It has one bit-serial
processor running a standard heapsort algorithm [19, pp.
145-149] on N words ofdata. Each comparison-exchange and
each "random" access to the input data takes O(lg N) time,
Fig. 2. The (Ig N)-processor heapsort for N = 16.
so a complete heapsort takes Td = Tp = O(N lg2 N) units of
time. The area performance ofthis design is A = 0(lg N).
Other fast sorting algorithms, such as mergesort or quick-
sort, could be used in a uniprocessor design. However, none
would yield a better AT3 performance, since all require O(N
lg N) random accesses to the processor's I/O memory.
B. (Ig N)-Processor Heapsort
Heapsort can be parallelized on a linear array of Ig N bit-
serial processors, one for each level ofthe heap [2], [40] (see
Fig. 2). The heapoperations arepipelined; during an insertion
(or deletion) a data element moves down (or up) the heap by
one level every 0(lgN) time units. The processor at the topof
the heap stores one data element, the smallest value that has
been encountered. The kth processor (0 < k <lgN) handles
2k elements, storing them in a (2k X lg N)-bit RAM. Total
sorting time is Td = Tp = O(N lg N), and the area is A =
O(lg2 N).
C. (1 + Ig N)-Processor Mergesort
The mergesort algorithm, likethe heapsort, fitsquite nicely
on about Ig N processors [45]. Two variable-length FIFO
queues are associated with each processor; processor Pk (0 <
k < lg N) has two 2k-word queues attached to its output
lines.
Referring to Fig. 3, processor Pk(k > 0) merges sorted lists
of length 2k 1 into sorted lists of length 2k. It does this by
placing the smaller ofthe elements at the head ofits two input
queues onto thetail ofoneofits output queues. Once an entire
output list of 2k elements is complete, the processor starts
filling its other output queue. This process repeats as long as
inputs are presented to the chip.
Processor PO is a special case. It merely "splits" its input
stream into two, placing alternate elements onto its left-hand
and right-hand output queues. These elements should be
considered sorted lists oflength 1, since they are "merged" into
sorted lists oflength 2 by processor P1.
To achieve maximal performance, it is tempting to use
pipelined, word-parallel processors. Unfortunately, it seems
to be impossible to use these efficiently. There is no way to
decide which data elements should next be entered into the
pipeline until the previous comparison is complete and the
appropriate queue is popped. Thus bit-serial processors are
used in this paper's mergesort design.
The FIFO queues between processors are most easily built
ofRAM memory. A 2k-word RAM's 0(k) access time is fast
1176
Authorized licensed use limited to: The University of Auckland. Downloaded on April 16,2010 at 03:36:28 UTC from IEEE Xplore.  Restrictions apply. THOMPSON: VLSI COMPLEXITY OF SORTING
Fig. 3. The (I + Ig N)-processor mergesort for N = 8.
enough to keep up with a bit-serial processor working on 0(lg
N)-bit words, since k ' ig N. (Alternatively, as pointed out
by an ingenious referee, each variable-length 2k-word FIFO
could be implemented with two 2k-word shift registers. Pro-
cessor Pi+1 can empty one of the shift registers while Pi fills
up the other one. Even though it requires twice as much
memory, the referee's idea is advantageous if shift register
memory cells are much smaller orcheaper than RAM memory
cells. The idea does, however, have a constant factor disad-
vantage in time since it introduces an additional 0(21) delay
between Pi's output stream and Pi+i's input stream.)
The time performance ofthe mergesorter is limited by the
data rate ofits individual processors. It takes Td = O(NIgN)
time units for all the input elements to clear the first processor,
and another O(NIgN) time units for theelements to percolate
through the O(N) words ofinternal FIFO storage. Total time
for a sort is thus Td = O(N lg N). The area ofthe design is A
= O(lg2 N), since each ofthe 1 + lg N processors fits into an
0(lg N) area rectangle.
D. (lg N)-Processor Bitonic Sort
Superficially, this design is very similar to the previous two
in that it uses 0(lg N) processors with geometrically varying
memory sizes. In this case, processor Pk(O < k <lg N) has an
auxiliary (N/2k+')-word fixed-length FIFO queue, as illus-
trated in Fig. 4. (Ifthe feedback path fromPig N-I toPo were
deleted, Fig. 4 would be identical to the "cascade" design for
pipelined FFT computation [9].)
The processors execute a bitonic sorting algorithm [19, p.
237]. Forthe purposes ofthis paper, thebitonic sortalgorithm
can be described as lg N "global iterations." Each global it-
eration consists oflog N "distance-(N/2k+l) operations" on
the N input values, in the following order: a distance-N/2
operation, a distance-N/4 operation, . . . , a distance-2 oper-
ation, and finally a distance-I operation. As indicated by the
repeated use of the index k, processor Pk is responsible for
performing all distance-(N/2k+l) operations. With this in
mind, a global iteration is seen to be one complete pass ofthe
data around the ring of processors in Fig. 4.
The "distances" in the "operations" refer to the natural
indexing ofdata values within the bitonic sorter. Initially, input
xi enters processor Pojust before input xi+1. Using its N/2-
word FIFO queue, Po is able to compare input XN/2+i with
input xi, and to exchange these values (if necessary) before
Fig. 4. The (Ig N)-processor bitonic sorter for N = 16.
passing a naturally indexed sequence (xx,x', * to processor
P1. In other words, processor Pk does compare-exchange op-
erations on data values whose indexes differ only in the kth
most significant bit. These index pairs are of the form (i,
N/2k+1 + i).
Finally, the "direction" of each processor's comparison-
exchange operations is not fixed during the course ofthe bi-
tonic sort algorithm. Sometimes xi and XN/2k+I+i should be
interchanged ifxi > XN/2k+I+i, sometimes they should be in-
terchanged ifxi <XN/2k+± +, and sometimes they need not be
interchanged under any circumstances (this last is a "no-op"
distance-(N/2k+1) operation). These three possibilities are
reflected in three patterns of data flow through individual
processors. When it uses pattern number 1, processorPk per-
forms a "no-op" by placing the elements it receives from pro-
cessor Pk- I onto the back ofits FIFO queue, and sending the
elements that come offthe front ofits queue to processor Pk+ 1.
In pattern number 2, processorPk does a comparison-exchange
on the element at the front ofits queue and the element it re-
ceives from processor Pk- 1, sending the smaller ofthe two to
processorPk+ I and placing the larger on the backofitsqueue.
Pattern number 3 is the same as pattern number 2, except that
the larger ofthe two elements is sent to the next processorand
the smaller is placed on the back ofthe queue. The complete
bitonic sort algorithm is summarized in Fig. 5.
Interpreting the algorithm of Fig. 5, processor Pk executes
pattern number I on the first (lg N - k -1)N elements it
encounters. This corresponds to (lg N - k - I) "global" it-
erations ofthe outermostloop in Fig. 5. ProcessorPk becomes
active in its reordering ofthe data stream only during the last
k + I global iterations. It alternately fills its queue with new
elements (executing N/2k+1 instances of pattern number 1)
and performs comparison-exchanges of the queue data with
the incoming data (executing N/2k+1 instances of pattern
number 2 or 3).
The conditional expressions ofstatements 7 and 8 may be
implemented with threecounters (for g, i, andj) in each pro-
cessor. The DIV and MOD operations merely select one bit
ofthese counters to control the pattern ofdata flow. Thus, the
bitonic sort can be performed on lg N bit-serial processors of
O(lg N) area, for a total area ofA = O(lg2 N). It takes Tp =
1 177
Authorized licensed use limited to: The University of Auckland. Downloaded on April 16,2010 at 03:36:28 UTC from IEEE Xplore.  Restrictions apply. IEEE TRANSACTIONS ON COMPUTERS, VOL. C-32, NO. 12, DECEMBER 1983
1. FOR.-0TOlgN-1DO
2. FOR j 0 TO2k-1 DO
3. FOR i 0 TO N/2'+l- DO /* FilH up FIFO with new datasf
4. (execute pattern 1);
5. OD;
6. FOR i 0 O TO N/2k+1-l DO /* Perform comparison-exchanges as necessary
7. IF , < tg N -k -1 THEN(execute pattern1)
8. ELSEIF ((j DIV (21+k+RIN)) MOD 2) - 0:HEN (execute pattern 2)
9. ELSE (execute pattern 3) Fl;
10. OD;
11. OD;
12. OD.
Fig. 5. The bitonic sort algorithm executed by processor Pk of Fig. 4.
0(lg N) time for a data element to pass through a bit-serial
processor, so that each "global iteration" takes O(N lg N)
times. Total time for a bitonic sort on bit-serial processors is
thus Td = O(N lg2 N).
The area * time2 performance of the design may be im-
proved by using word-parallel processors. Now each global
iteration requires only O(N) time, if 0(lg N) communication
lines are provided between processors. Total time is Td = O(N
Ig N); total area is still A = O(lg2 N). Note that this paral-
lelized design requires O(lg2 N) I/O ports, in order to provide
sufficient memory bandwidth to the FIFO queues. Also, por-
tions of the control algorithm will have to be hard-wired, so
that the three counters described above can be incremented
in 0(1) time.
It is interesting that the lg N processor heapsorter has ex-
actly the same area and timeperformance asthelgNprocessor
bitonic sorter, even though the heapsorter does not useparal-
lelized comparators. The heapsort algorithm requires each of
the lg N processors to make "random accesses" to their local
memory. The extra time taken by these slower accesses is ex-
actly balanced by the greater number ofcomparison-exchange
operations required by the bitonic sorting algorithm.
(Chung, Luccio, and Wong have also proposed a lg N-pro-
cessor bitonic sort for a magnetic bubble memory system [8].
Their algorithm has an inferior time performance to the one
described above, since only one oftheir processors is active at
any time.)
E. 0(1g2 N)-Processor Bitonic Sort
This design "unrolls" the lg N-processor bitonic sort, so that
each processor is responsible for only one distance-N/2k+I
operation. Sinceabout halfofthelg2Noperations ofthebitonic
sort algorithm are no-ops, only about (1/2) lg2 N processors
are required in this version ofthe algorithm; see Fig. 6. Each
processor fits in an 0(l)-by-O(lg N) unit rectangle, so the
entire design occupies O(lg3 N) area.
A surprisingly large amount oftime and FIFO storage area
is saved by eliminating the no-ops when "unrolling" the bitonic
sort on lg N processors. Since a distance-N/2k+I operation
is implemented with N/2k+I words ofFIFO storage, and since
all but k + 1 ofthedistance-(N/2k+±) operations are no-ops,
the total storage is 2(k + 1)(N/2k+I), or a little less than 2N
words. The problem solution time is proportional to the length
LP5yj
Fig. 6. The (1/2)(Ig N)(1 + Ig N)-processor bitonic sorter for N = 8.
ofthis pipeline, or Td = O(N) ifword-parallel processors are
used. The area performance is halfofits total area, A = O(lg3
N), because the pipeline stores two problems at a time.
TheAT3 performanceofthisdesign is a factoroflgNbetter
than that ofthe previousdesign. To understand this phenom-
enon, it is helpful to compare the performance of one O(lg2
N)-processor bitonic sorter with that of a collection of lg N
identical (lg N)-processor bitonic sorters. Both have the same
amount oftotal area, and both solve lg Nsorting problems in
Ttotal = O(N lg N) time (ifword-parallel processors are used).
Due to the elimination of the no-ops, however, the O(lg2
N)-processor implementation solves each sorting problem with
logarithmically less delay.
F. NVN lg N-Processor Bitonic Sort
Chung, Luccio, and Wong have recently proposed imple-
menting a bitonic sort on N lg_N_processors in a linear array
[8]. Here, each processor has \/N/lg N words ofshift register
storage. It can run a serial bubble sort algorithm on its local
store in Td = O(N/lg N) time, if it uses word-parallel pro-
cessors. Working together, the entire array performs an N-
element sort in Td = O(N) and A = 0( /N1g3 N). The total
area increases to A = O(N lgN) ifthe shift registers are made
oflogic nodes rather than I/O memories, in order to decrease
the circuit's unreasonably large I/O bandwidth.
According to the model ofthis paper, thisapproach is highly
nonoptimal in an AT3 sense. It is no faster, but much larger,
than the O(lg2 N)-processor bitonic sorting design.
G. (N/2)-Comparable Bubble Sort
As noted by a number of researchers, the bubble sorting
algorithm can be fully parallelized on a linear array ofN/2
bit-serial (or pipelined) comparison-exchange modules [3],
[6], [8], [12], [15], [24], [28], [29]. Each module performs
the following simple computation: Of the two data elements
it receives from its left- and right-hand neighbors, it sends the
smaller to the left and the larger to theright. The array can be
1178
Authorized licensed use limited to: The University of Auckland. Downloaded on April 16,2010 at 03:36:28 UTC from IEEE Xplore.  Restrictions apply. THOMPSON: VLSI COMPLEXITY OF SORTING
initialized in parallel with zeros, then serially loaded with N
data elements through the leftmost module. If it is then
"flushed out" by loading maximal elements through the left-
most module, the N data elements will emerge from the left-
most module in O(N) comparison times.
The area ofthe N/2-comparator bubble sorter is A = O(N
lg N), since each comparator occupies 0(lg N) area. When
bit-serial modules are used, each comparison takes 0(lg N)
time, so Td = O(N lg N). Word-parallel modules improve the
bubble-sorter's delay to Td = O(N). Even so, its AT3 perfor-
mance remains dismal. According to the AT3 = Q(N2 lg N)
lower bound, a sorter with O(N lg N) processing area should
sort in about Td = o(V\N) time.
There are at least three other ways ofsorting Nnumbers on
O(N) processors with similar area-time performance figures.
Heapsort can be run on a balanced binary treeofN bit-serial
processors [26, pp. 297-299]. This tree structure can also be
used to perform the broadcast operations required in recently
proposed implementations ofenumeration sort [49] and radix
sort [10], [47] on N processors. If built from bit-serial pro-
cessors, these designs would have A = O(N lg N), Td = O(N
lg N). It might be possible to use pipelined comparators and
word-parallel communication lines efficiently in these designs,
although it is not obvious how this can be done. In any event,
the AT2 performance of these designs is intrinsically cubic
(instead ofquadratic) in N, making them highly nonoptimal
for large sorting problems.
H. N-Processor Bitonic Sort on Mesh
The bitonic sort can be adapted to run very efficiently on N
bit-serial processors connected in a square mesh [31], [42].
Word-parallel connections are used in the mesh in order to
speed up the movement ofdata over long distances.
The operation ofthis algorithm is rather complicated and
will not be explained here. It is sufficient to know that the O(N
lg2 N) comparison-exchanges in the bitonic sort require a total
of 0(jg3 N) ofthe Nprocessors' time. However, it can take as
much as 0(\IfN) time to rearrange the data among the pro-
cessors in preparation for the next comparison-exchange step.
Fortunately, only a few ofthecomparison-exchange operations
take this amount oftime, so that the total time to sort N ele-
ments is only Td = O(VIN).
To achieve the time bound asserted above, it is necessary to
move words of data from one processor to the next in 0(l)
time. This is a littledifficult to arrange, sincethewires between
neighboring processors are O(lg N) units long. According to
the model's loading rules [see Assumption If)], Td = 0(lglg
N) time is required to amplify a signal for a wire ofthis length.
Once a signal has been applied, it can be received, gated, and
retransmitted in Td = 0(1) time by O(lg N) identical, unit-
sized nodes working in concert.
The total area ofthe design is A = O(N Ig2 N). Note that
the N processors take up only O(N lg N) area. The word-
parallel data paths (and their high-power drivers) require more
room than the processors themselves, in the asymptotic
limit.
(Batcher's odd-even sorting algorithm [21] or a mefge sort
algorithm [41] may have constant factor advantages for some
N. Aside from these constant factors, both algorithms can be
implemented on the mesh with the same AT3 performance as
that derived above for Batcher's bitonic sort.)
I. N-Processor Bitonic Sort on Shuffle-Exchange Net
Stone notes that the bitonic sort is easily adapted to run on
N bit-serial processors interconnected in theshuffle-exchange
pattern [39]. Ifbit-serial interconnections are used, the O(N
lg2 N) comparison-exchanges in the bitonic sort take a total
of Td = 0(lg3 N) time.
Given that this design sorts so quickly, it should not be
surprising that it cannot be laid out in a small amount ofarea.
Indeed, it is possible to prove that the smallest layout for the
shuffle-exchange graph occupies Q(N2/lg2N) area [42]. An
embedding ofthis size has been recently obtained [18], [23].
This embedding can be "stretched" in the vertical direction
in 0(lg N) places to make room for N bit-serial processors,
each occupying an 0(1)-by-0(lgN) rectangle. The modified
embedding has area A = O(N2/1g2 N), and its longest wires
are oflength O(N/lg N).
The AT3 performance ofthis design is a little suboptimal.
One might attempt to improve it by adding parallelism to the
interprocessor communication lines. If, for example, k wires
were laid down for every edge in the shuffle-exchangegraph,
the resulting network could conceivably sort k times faster.
Unfortunately, network area would increase by a factor ofk2,
leaving theAT2 performance figure unchanged.
(Toachieve thefactorofk speedup alluded toin theprevious
paragraph, the processors would have to be fitted with parallel
comparison-exchange modules. More significantly, the size
ofeach processor would have to be increased, so that it could
drive its output wires with only 0(lg N)/d delay. Thus, it
seems that the maximum-possible speedup factor of k = O(Ig
N) would be very difficult to achieve in a layoutO(lgN) times
as large: there would only be O(N) area available for each of
the N processors; each ofthe O(lgN) gates in each processor
could be implemented with at most 0(N/lg N) unit-sized
nodes; the longest wires in a Leighton-style layout would have
length O(N); the long-wire driver delay would be 0(lglg N);
the time per comparison-exchange would be dominated by
these long-wiredelays; and the resulting circuit would sort in
Td = O(lg2N lglg N) instead of0(jg2 N). Theproblem with
this construction is that no one knows how-or whether it is
possible-to lay out theN-nodeshuffle-exchange graph using
edges oflength at most O(N/lg2 N) [23].)
J. N-Processor Bitonic Sort on the PCCC
Preparata and Vuillemin [33] have shown that their
"cube-connected cycles" (CCC) interconnection pattern can
run the bitonic sort algorithm as efficiently as the shuffle-
exchange pattern; A = O(N2/1lg2N) and Td = 0(1g3 N). Their
network has the advantage ofhaving a simple, asymptotically
optimal, layout; the asymptotically optimal layout for the
shuffle-exchange graph is much less uniform. On the other
]1179
Authorized licensed use limited to: The University of Auckland. Downloaded on April 16,2010 at 03:36:28 UTC from IEEE Xplore.  Restrictions apply. IEEE TRANSACTIONS ON COMPUTERS, VOL. C-32, NO. 12, DECEMBER 1983
hands thebitonic sortalgorithm fortheCCC is somewhat more
complicated than the bitonic sort on the shuffle-exchange.
Quite recently, the CCC design has been improved to
achieveAT3 optimality over a wide rangeofareas and times.
The new construction is called the "pleated cube-connected
cycles" (PCCC) [4]. The idea behind the construction is to
provide more "short" wires fortheshorter-distance operations
of the bitonic sort algorithm (see Section III-D). The less-
frequent long-distance operations are performed more slowly
over fewer "long" wires. As the average wire length ofa PCCC
increases, its area increases butitssorting time decreases. The
result is a family ofnetworks whose areas rangefromA = O(N
lgN) toA = O(N2/1g4N). Each network sorts inATd-optimal
time: from Td = O( N-lg-N) to Td = O(Ig3 N).
The times quoted for the PCCC are for bit-serial imple-
mentations. Conceivably, parallel PCCC networks could be
devised to run a bitonic sort in A = O(N2/1g2 N) and Td =
O(Ig2 N). Any such construction would becomplicated by the
problem of wire delays that was encountered in the attempt
in Section 111-I to speed up the shuffle-exchange network.
(Another approach to improving the CCC's performance
is suggested by Reif and Valiant [34]. They have devised a
probabilistic algorithm that would run in Td = 0(lg2 N) time
on the A = O(N2lg2 N) CCC, if each processor had access
to a random number generator. Itwould beinteresting toknow
if this time performance is achievable by a deterministic al-
gorithm, in keeping with this paper's model of computa-
tion.)
K. (N Ig2 N)-Comparator Bitonic Sort
Batcher's bitonic sorting network [19, p. 237] can be laid
out explicitly on a VLSI chip. Each ofthe (1/2)(1g2 N + lgN)
parallel comparison-exchange operations is implemented by
a row of N/2 bit-serial comparison-exchange modules. The
bit-serial interconnections between the rows of comparators
require more room than the comparators themselves, at least
asymptotically. Thewiring in front ofthecomparators doing
a "distance-(N/2k+I) operation" takes up an O(N/2k+1) by
O(N) area ofthe chip. (As in Section III-D, 0 < k <lg N.)
The total area occupied by the network is thus 2(k + 1)-
(N2/2k+I) = 0(N2), since there are k + 1 distance (N/2k+l)
comparison-exchange operations.
The total delay through the network is Td = O(1g3 N), for
each ofthe0(lg2N) rows adds O(lgN) delay. Sincethereare
no feedback paths, the network is easily pipelined with a con-
currency ofO(1g2N). The area performance is the total area
divided by the concurrency A - O(N2/lg2 N).
An improvement can be made totheconstruction outlined
above, leading to a better AT2 performance. There is no need
for multiple stages ofamplification at the outputs ofthe bit-
serial comparators ifthecomparators themselves are"scaled
up" to match the length oftheir output wires. One such com-
parator for a distance-(N/2k+I) operation occupies O(N/
gk+I) area. (An O(lgN)-stage driver is needed at theoutput
of each distance-(1) comparator to amplify its signal for the
following scaled-up comparator.) The total area is still O(N2):
fthe comparators now take up space asymptotically equal to
that of the long-wire drivers in the original construction.
Sorting delay is decreased since bits travel from one row of
comparators to the next in Td = 0(1) time. The improved
network has total delay Td = O(lg2 N), concurrency O(lg N),
and area performance A = O(N2/lg N).
Pratt has pointed out that shellsort [19, p. 84] can be im-
plemented on either (lg2 N) or (N lg2 N) processors with the
same ATdperformance as the bitonic sort.
As this articlewas going to press, an O(NlgN)-comparator,
O(lgN)-depth sorting network was reported [1]. Implemented
in a fully parallel fashion, it could be built ofO(lgN) rows of
N/2 comparators each. Theconnections between the rows are
"expander graphs" with an average wire length ofO(N). Total
area is O(N2lgN). Sorting delay need only be Td = O(lgN)
ifeach comparator is "scaled-up" by a factor ofN. Thus, ATd
= O(N2 Ig3 N), an asymptotic improvement over the O(N lg2
N)-comparator bitonic sorter. This asymptotic improvement
is nonetheless a constant-factor disaster, for there are an as-
tronomically large number ofrows in the currently proposed
"O(lg N)-depth" construction.
L. N2-Comparator Bubble Sort
A final attempt can be made tooptimize the bubble sort for
VLSI, providing a different comparison-exchange module for
each oftheN2 comparisons in a bubble sortonNelements [19,
p. 224]. The resulting network is not very impressive. Ifbuilt
from bit-serial comparators, it occupies O(N2) area. Total
delaythrough the network is Td = O(N), andN/lgNproblem
instances will fit in it at any given time. Its area performance
is its total area divided by its concurrency, A = O(N lg N).
Note that the same time performance can be obtained in a
small fraction of this area with the lg2N-processor bitonic
sorting design.
When builtofpipelined comparison-exchange modules, the
N2-processor bubble sorter occupies a total ofN2 lg2 N area.
Its concurrency increases to about N lg N, giving it the same
area performance as before, A = O(N lg N). Strangely
enough, its time performance worsens, becoming Td = O(N
lg N). The reason for this anomalous behavior is that the bit-
serial implementation is already fully pipelined.
M. (N2)-Processor Rank Sort
Consider a square array ofN2 processors, interconnected
in the following peculiar way. The N processors in each row
are the leaves ofa balanced binary tree; the internal vertices
of the "row trees" provide bit-serial communication paths
between the root ofthe tree and its N"leaf" processors. Sim-
ilarly, a "column tree" provides connections between the N
processors in each column ofthe array. Each processor is thus
a leafvertex in two orthogonal trees.
This network has been called by various names, including
the "orthogonal tree network" [321 and the "mesh oftrees"
[22], [23]. Fig. 7 illustrates it for the case N = 4.
A brute-forcesortingalgorithm can beimplemented on this
network, as pointed out by the authors cited above. (Muller
and Preparata [30] describe this algorithm without reference
to the natural shape oftheorthogonal tree network.) Each of
the N inputs to a sorting problem can be presented to one of
the root vertices ofa row tree. The inputs are then broadcast
11&0
Authorized licensed use limited to: The University of Auckland. Downloaded on April 16,2010 at 03:36:28 UTC from IEEE Xplore.  Restrictions apply. 1181 THOMPSON: VLSI COMPLEXITY OF SORTING
INPUT: t
[2] -:
[3]
Fig. 7. The orthogonal tree network for N = 4.
TABLE I
AREA-TIME BOUNDS FOR THE SORTING PROBLEM
Design Area Perf. (A) Time Perf. (Tj) AT2
Lower bound . -(N1L, N)
1. Uniprocemor (s.) I# N N lg2N N2 l'N
2. to N - proc. heapsort (s.) LgoN N Io N N2 Lg4N
3. lu N - proc. mergesort (s.) lg2N N to N N2 Lg4N
4. I9 N - proc. bitonic (p.) lg2N N to N M lg'N
S. lg2N - proc. bitonic (p.) lgON N N' l3N
6. a lVN7 - proc. bitonic (p.) N Ig N N N'3g N
7. N/2 - comp. bubble (p.) N It N N NJ to N
8. N - proc. bitonic, mesh (s.) N lg2N VN2 Ig2N
9. N - proc. bitonic, S-E (a.) N2/lg2N l9gN N2 (94N
10. N - proc. bitonic, PCCC (s.) N2Lg2NITJ lg'N<T<gVN7jW N2 lg2N
11. N lg2N - comp. bitonic (p.) N2/i, N to2N N2 lg'N
12. N'2- comp. bubble (a.) N to NN Na t N
13. N' - proc. rank sort (s.) N'L'N to N N'l4N
to the leaves ofthe tree, so that each processor in a row has a
copyofthat row's input. Next, thecolumn treescomeintoplay:
thejth leafprocessor ofthejth column treesends a copy ofits
input to its root. This value is "broadcast" downwards through
the column trees, so that processor (i,j) now contains copies
oftwo input values, input[i] and inputU].
The next step in the sorting algorithm is to compute the
ranks ofthe inputs. The ith row treeevaluates the rankofinput
i by "summing" the results of comparing input[i] with in-
put[1. To be more specific, processor (i,j) compares its two
input values, sending a "1" up through its row tree ifinput[i]
> input[j] or if ((input[i] = inputU]) and (i > j)). These
values are summed by the row trees. A moment's reflection
should convince the reader that the sum ofthe values in row
i is the rank of input[i], with ties being broken by the i > j
calculation. The root of each row tree will have a different
integer from the range ofpossible ranks, [0, N - 1].
The input ranks are next broadcast to the leafvertices ofthe
row trees. Finally, processor (i,j) sends up ("selects") thevalue
ofinput[i] through its column tree ifrank[i] = j. The sorted
values are now available at the roots ofthe column trees.
It is a straightforward exercise to verify that the internal
vertices can perform the broadcast, summation, and selection
operations in a bit-serial fashion with 0(1) bits ofstorage and
0(1) logic devices.
It remains to establish the area and time complexity ofthis
sorting procedure. Since broadcast, summation, and selection
operations are performed on trees with N leaves, the best
possible time performance is Td = O(lg N). Also, as proved
in [231, the minimum possible area for the orthogonal tree
network is O(N2lg2N). Theseperformancefigures arein fact
achievable, but only with careful design.
Observe that in Fig. 7 the wires connecting vertices in the
row and column treesare not all ofthe same length. The closer
one gets to the root, the longer the wires; the wires double in
length from one level ofthe tree to the next. Thus, each logic
device in an internal vertex on level k should be built ofO((N
lg N)/2k) unit-sized nodes, enabling the vertex to drive its
O((NlgN)/2k)-length wires at unit delay, as required. Since
there are 0(1) logic devices in an internal vertex, and since
there are 2k vertices at level k, each level fits into the 0(1)-
by-O(N lg N) rectangle available to it in the "obvious" O(N2
lg2 N)-area layout suggested by Fig. 7.
IV. COMPARISON OF THE DESIGNS
The area and time performance ofthe thirteen sorting cir-
cuits is summarized in Table I below. For easy reference, the
designs are numbered from 1 to 13 in correspondence with
Sections III-A-III-M in which they are discussed. The "s" or
p" in thedesign's name indicates whether it is implemented
with serial or pipelined comparators. The "areaperformance"
column gives each design's processing area divided by its
concurrency, in accordance with the definition in Section II.
This metric is an indication ofthepowerconsumed persorting
problem. A design's "time performance," as formally defined
in Section II, can becharacterized as theelapsed time between
the first input to the circuit and the last output from the chip
for each sorting problem.
Table I shows that nearly all the designs considered in this
paper are within a factor ofO(lgk N) ofbeing optimal in an
AT2 sense. The sole exceptions are the bubble sorters and the
N1g N-processor bitonic sorter.
Table II contains additional information about the sorters.
(The PCCC design appears twice in this table, on lines 10a and
lOb, in its slowest and fastest forms. Intermediate figures are
possible: see Section III-J.) The first entry for each design
shows its concurrency, defined as the number of sorting
problems that should be solved simultaneously to achieve the
maximum possible area performance. The second and third
entries indicate the total "processing" and "memory" area
required in a systemcon'taining thatdesign. (One bitofRAM
or shift-register memory occupies one unit ofarea.) The final
entry is thedesign's processor-memory bandwidth in bits/(unit
time). This entry is also equal to the number ofI/O ports re-
quired on thedesign's "processing" chip, sinceall thedesigns
keep all their ports busy essentially all the time.
Thus far in the paper,Amemoryand I/O bandwidth have not
been considered explicitly. Furthermore,Aprocessing has largely
ignored in deference to the theoretically "correct" area per-
formance measure A = Aprocessing/c. From a systems per-
spective, however, themeasures ofTable II are moreimportant
than thoseofTableI. Adesign will occupycircuit board area
proportional to Aprocessing + Amemory, and its "processing"
Authorized licensed use limited to: The University of Auckland. Downloaded on April 16,2010 at 03:36:28 UTC from IEEE Xplore.  Restrictions apply. IEEE TRANSACTIONS ON COMPUTERS, VOL. C-32, NO. 12, DECEMBER 1983
TABLE 11
OTHER PERFORMANCE MEASURES
Design Concurrency A, Ac** I/O B.W.
1. Uniprocessor I to N NIt N 1
2. 1, N -proc. heapwort 1 Ig2N NIt N Is N
3. I# N - proc. mergesort I I#2gv N t N lo N
4. IJ N - proc. bitonk I #2N N tg N Ig2N
5. IgN - proc. bitonk 2 194N N to N t1N
6. VY7#-W - proc. bitonic I N tg N NIt N Ig N
7. N/2- proc. bubble I N t N Nlt N It N
8. N-proc.bitonc,mesh I NI2N NIN V/W l N
9. N - proc. bitonk, S-E 1 NM/g'2N N to N N/Ig2N
10a. N - proc. bitonic, PCCC 1 N to N N to N VY7gWff
1Ob. N - proc. bitonkc, PCCC I N21/I4N N to N N/1'2N
11. N lg2N - proc. bitonic I N N2 N #2N N
12. N2- proc. bubble Nl/i N N2 N2 N
13. N2- proc. rank sort N2lg2N N to N N
portion will have to have enough pins to support its I/O
bandwidth.
Ofcourse, a sorting circuit should not be selected just be-
causeit is asymptotically optimal. A digital engineer is inter-
ested only in actual speeds and sizes. Although the model of
computation ofthis paper is not exact enough to permit such
analyses, some statements can be made about the relative sizes
and speeds ofthe designs.
The smallest design is clearly the O(lgN) area uniprocessor.
Somewhat surprisingly, this design is nearly ATd optimal if
it is programmed to use any ofthe O(N ig N)-step serial al-
gorithms.
If more sorting speed is desired, the (lg N)-processor
heapsort design becomes attractive. It requires almost exactly
lg Ntimes as much area as the uniprocessor design, since the
processors and programs for the twodesigns areverysimilar.
The design has the smallest possible delay ofany sorter that
receives its inputs in a single bit-serial stream, since the first
output is available immediately after the last input has been
received. (The N/2-processor bubble sorter is the only other
design considered in this paper that has this property. All
others introduce at least an O(lg N) delay between the last
input time and the first output time.)
A major drawback of (lg N)-processor heapsorter is that it
requires lg N independently addressable memories, one for
each processor. The total memory-processor bandwidth in-
creases proportionately (see Table II) to lg N bits per time
unit.
The (lgN)-processor bitonic design has been the same area
and time performance as the (lg N)-processor heapsort design.
The former has the advantage of a slightly simpler control
algorithm, and it uses the simpler shift-register type of I/O
memory; the latter uses a moreefficientsortingalgorithm and
hence less memory bandwidth.
The (1g2 N)-processor bitonic sorter is smaller than either
of the (lg N) processor designs, for moderately sized N. Its
control algorithm is extremely simple, so that a "processor"
is not much more than a comparison-exchange module. Its
major drawback is that it makes continuous use of(1/2) * (lg
N) * (Ig N + 1) word-parallel shift-register memories, of
various sizes.
The (V/N lg N)-processor bitonic sorter has been entered
in Table II witha total area ofO(NlgN), sothatthere is room
on the chip for all ofits temporary storage registers. Otherwise,
it would require /N lg Nseparate I/O memories. It has the
same speed and a somewhat better I/O bandwidth than the
(lg2 N)-processor bitonic sorterjust discussed. However, the
latter's shift registers could also be placed on the same chip as
its processing circuitry, equalizing the I/O bandwidth for the
two designs. When "constant factors" are taken into consid-
eration, the(IN IgN)-processor design isclearly much larger
than the (lg2 N)-processor design, because it has more pro-
cessors and a much more complicated control algorithm.
The (N/2)-processor bubble sorter has a couple ofsignifi-
cant advantages that arenot revealed in eitherTable IorTable
II. Its comparators need very little in the way of control
hardware, sothat at least for smallN, itoccupies less area than
any ofthe preceding designs. Also, it can be used as a "self-
sorting memory," performing insertions and deletions on-line.
(Theuniprocessor and the (lgN)-processor heapsortercan also
be used in this fashion.) However, for even moderately sized
N, the bubble sorter's horrible ATd performance becomes
noticeable. For example, when N = 256, the (lg2 N)-processor
bitonic sorter's 36 comparators and 491 words of storage
probably occupy less room than the 128 comparators in a
bubble sorter. Nonetheless, the bubble sorter always maintains
about a 2:1 delayadvantageoverthe (lg2N)-processor bitonic
sorter, when similar comparators are used.
TheN-processor mesh-type bitonic sorter is the first design
to solve a sorting problem in sublinear time. Unfortunately,
it occupies a lot of area. Each of its processors must run a
complicated sorting algorithm, reshuffling the data among
themselves aftereverycomparison-exchangeoperation. Its I/O
bandwidth must also be large, since it solves sorting problems
so rapidly. However, constant factor improvements may be
made toits area and bandwidth figures byreprogramming the
processors sothat each handles several data elements at a time.
Also, large area and bandwidth are not always significant
problems: in an existing mesh-connected multiprocessor, the
N processors are already in place and the I/O data may be
produced and consumed by local application routines.
The next three designs in Tables I and II arevariants on a
fully parallelized bitonic sort. Theshuffle-exchange sorter has
a slight area advantage overthe CCC or PCCC sorter because
ofitssimpler control algorithm. However, theCCC and PCCC
are somewhat more regular interconnection patterns, so that
they may be easier to wire up in practice. Both designs are
smaller in total asymptotic area than the (Nlg2 N)-processor
bitonic sorter, which solves lg N problems at a time. None-
theless, thecontrol structureofthis lastdesign issosimplethat,
as a rough guess, it takes less area than theothers for all N <
220. (Ofcourse, ifashuffle-exchange or PCCC processor has
already been built, the additional area cost for programming
the sorting algorithm is very-small.)
There seems to be little to recommend the N2-processor
bubble sorter. IthasthesameI/O bandwidth, a bit moretotal
area, and a much worse time performance than the (N lg2
N)-processor bitonic sorter.
Finally, theN2-processor rank sorter can becharacterized
1182
Authorized licensed use limited to: The University of Auckland. Downloaded on April 16,2010 at 03:36:28 UTC from IEEE Xplore.  Restrictions apply. THOMPSON: VLSI COMPLEXITY OF SORTING
as being larger but not all that much faster than the N- and (N
lg N)-processor bitonic sorters. Its chiefinterest is theoretical:
it sorts in a minimal number, O(lg N), ofgate delays. No other
sorting circuit ofequivalent time performance could possibly
beat its area performance by more than a logarithmic factor
or two, considering the theoretical limit ofAT3 = Q(N2 lg).
However, it remains an open question whether it is possible to
build a T = O(lg N) sorter that occupies even a little less
area.
V. CLOSING REMARKS
At the time ofthis writing, there are a number ofimportant
open questions in VLSI complexity theory. A simply stated but
seemingly perplexing problem is to find out how much area can
be saved when additional "layers" ofwiring are made available
by technological advances. It is known that a k-level em-
bedding can be no smaller than 1/k2 ofthe area of a two-level
embedding [42, pp. 36-38], but it is not known whether this
bound is achievable. (Some results on k-level embedding have
been obtained recently [35].)
A second problem is to derive matching upper and lower
bounds for the area * time2complexity ofthesorting problem.
The best upper bound is AT3 = O(N2 lg2 N), achieved by the
N-processor bitonic sort on a mesh and by the PCCC. As dis-
cussed in Section 1, the best lower bound isATp2 = Q(N2 lgN)
[44], which leaves a gap of O(lg N). The gap can be closed by
adding the assumption that all lg N bits ofeach inputvalue are
read in through a single I/O port [42]. (The current model
allows the bits of each input value to be read in through dif-
ferent ports.) It seems probable that theATp2
='Q(N2 lg2 N)
result for word-oriented, "localized" I/O can be extended to
handle the less restrictive model of this paper. On the other
hand, it is conceivable that such a bound is impossible because
of the existence of some yet-to-be-discovered sorting circuit
with an AT3 performance better than that ofthe bitonic sort
on the mesh or PCCC.
A third problem is to evaluate separately the VLSI com-
plexity of the "ranking" and "permutation" subproblems of
the sorting problem, as discussed in Section I. Current lower
bounds for the sorting problem can be viewed as arguments
about the data flow required in a permuter; they seem to say
little about the problem ofranking the data. Lower bounds on
ranking would presumably be more subtle and give more in-
sight into how "control information" must flow in a sorting
circuit.
Another set of problems is opened up by the fact that the
area * time2 bounds are affected greatly by nondeterministic,
stochastic, or probabilistic assumptions in the model. My in-
tuition is that the VLSI complexity ofsorting is not sensitive
to such assumptions, but it would be nice to be sure of this.
Counterintuitive results have already been proved: equality
testing is very easy if the answer need only be "probably"
correct [25], [48].
A final and very important problem in VLSI theory is thee
development of a stable model. Currently there are almost as
many models as papers. Ifthis trend continues, results in the
area will become difficult to report and describe. However, it
is far from settled whether wire delays should be treated as
being linear or logarithmic in wire length, and the costs of
off-chip communication remain unknown.
ACKNOWLEDGMENT
Referee l's detailed and thought-provoking report is
gratefully acknowledged.
REFERENCES
[1] M. Ajtai, J. Komlos, and E. Szemeredi, "AnO(n log n) sorting network,"
in Proc. 15th Annu. ACM Symp. Theory Comput., Apr. 1983, pp.
1-9.
[2] P. K. Armstrong, U.S. Patent 4 131 947, Dec. 26, 1978.
[3] P. K. Armstrong and M. Rem, "A serial sorting machine," Comput.
Elect. Eng., Pergamon, vol. 9, Mar. 1982.
[4] G. Bilardi, M. Pracchi, and F. P. Preparata, "A critique of network speed
in VLSI models of computation," IEEE J. Solid-State Circuits, vol.
SC-17, pp. 696-702, Aug. 1982.
[5] R. Brent and H. T. Kung, "The area-time complexity of binary multi-
plication," J. Ass. Comput. Mach., vol. 28, pp. 521-534, July 1981.
[6] T. C. Chan, K. P. Eswaren, V. Y. Lum, and C. Tung, "Simplified odd-
even sort using multiple shift-register loops," Int. J. Comput. Inform.
Sci.,vol. 7, pp. 295-314, Sept. 1978.
[7] B. Chazelle and L. Monier, "Towards more realistic Inodels ofcompu-
tation for VLSI," in Proc. I1th Annu. ACM Symp. Theory Comput.,
Apr. 1979, pp. 209-213.
[8] K.-M. Chung, F. Luccio, and C. K. Wong, "On the complexity ofsorting
in magnetic bubble memory systems," IEEE Trans. Comput., vol. C-29,
pp. 553-562, July 1980.
[9] A. Despain, "Very fast Fourier transform algorithms for hardware
implementation," IEEE Trans. Comput., vol. C-28, pp. 333-341, May
1979.
[10] Y. Dohi, A. Suzuki, and N. Matsui, "Hardware sorter and its application
to data base machine," in Proc. 9th Annu. Symp. Comput. Arch. (ACM
SIGARCH Newsletter), vol. 10, Apr. 1982, pp. 218-225.
[ 1] S. A. Evans, "Scaling I2L for VLSI," IEEE J. Solid-State Circuits,
vol. SC-14, pp. 318-326, Apr. 1979.
[12] L. J. Guibas and F. M. Liang, "Systolic stacks, queues, and counters,"
in Proc. 1982 Conf Advanced Res. VLSI, Massachusetts Inst. Technol.,
Cambridge, Jan. 1982, pp. 155-164.
[13] J.-W. Hong and H. T. Kung, "I/O complexity: The red-blue pebble
game," in Proc. 13th Annu. ACM Symp. Theory Comput., May 1981,
pp. 326-333.
[14] J.-W. Hong, "On similarity and duality ofcomputation," Peking Nlunic.
Computing Center, Peking, People's Repub. China, 1981, unpub-
lished.
[15] G. Kedem, "A first in, first out and a priority queue," Dep. Comput.
Sci., Univ. Rochester, Rochester, NY, Tech. Rep. 90, Mar. 1981.
[16] Z. M. Kedem and A. Zorat, "Replication of inputs may save compu-
tational resources in VLSI," in Proc. 22ndSymp. Found. Comput. Sci.,
IEEE Comput. Soc., Oct. 1981.
[17] M. B. Ketchen, "AC powered Josephson miniature system," in Proc.
1980 Int. Conf. Circuits Comput., IEEE Comput. Soc., Oct. 1980, pp.
874-877.
[18] D. Kleitman, F. T. Leighton, M. Lepley, and G. L. Miller, "New layouts
for the shuffle-exchange graph," in Proc. 13th Annu. ACM Symp.
Theory Comput., May 1981, pp. 334-341.
[19] D. E. Knuth, The Art ofComputer Programming, Vol. 3: Sorting and
Searching. Reading, MA: Addison-Wesley, 1973.
[20] ,"Bit omicron and big omega and big theta," SIGACTNews, vol.
8, pp. 18-24, Apr.-June 1976.
[21] M. Kumar and D. S. Hirschberg, "An efficient implementation of
Batcher's odd-even merge algorithm and its application in parallel
sorting schemes," IEEE Trans. Comput., vol. C-32, pp. 254-264, Mar.
1983.
[22] F. T. Leighton, "New lower bound techniques for VLSI," in Proc. 22nd
Symp. Found. Comput. Sci., IEEE Comput. Soc., Oct. 1981.
[23] , "Layouts for the shuffle-exchange graph and lower bound
techniques for VLSI," Ph.D. dissertation MIT/LCS/TR-724, M.I.T.
Lab. for Comput. Sci., Massachusetts Inst. Technol., Cambridge, June
1982.
[24] C. E. Leiserson, "Area-efficient VLSI computation," Ph.D. dissertation
CMU-CS-82-108, Comput. Sci. Dep., Carnegie-Mellon Univ., Pitts-
burgh, PA, Oct. 1981.
1183
Authorized licensed use limited to: The University of Auckland. Downloaded on April 16,2010 at 03:36:28 UTC from IEEE Xplore.  Restrictions apply. IEEE TRANSACTIONS ON COMPUTERS, VOL. C-32, NO. 12, DECEMBER 1983
[25] R. J. Lipton and R. Sedgewick, "Lower bounds for VLSI," in Proc. 13th
Annu. ACMSymp. Theory Comput., May 1981, pp. 300-307.
[26] C. Mead and L. Conway, Introduction to VLSI Systems. Reading,
MA: Addison-Wesley, 1980.
[27] H. P. Moravec, "Fully interconnecting multiple computers with pipelined
sorting nets," IEEE Trans. Comput., vol. C-28, pp. 795-798, Oct.
1979.
[28] A. Mukhopadhyay and T. Ichikawa, "An n-step parallel sorting ma-
chine," Dep. Comput. Sci., Univ. Iowa, Iowa City, Tech. Rep. 72-03,
1972.
[29] A. Mukhopadhyay, "WEAVESORT-A new sorting algorithm for
VLSI," Univ. Central Florida, Orlando, Tech. Rep. 53-81, 1981.
[30] D. E. Muller and F. P. Preparata, "Bounds to complexities ofnetworks
for sorting and for switching," J. Ass. Comput. Mach., vol. 22, pp.
195-201, Apr. 1975.
[31] D. Nassimi and S. Sahni, "Bitonic sort on a mesh-connected parallel
computer," IEEE Trans. Comput., vol. C-28, pp. 2-7, Jan. 1979.
[32] D. Nath, S. N. Maheshwari, and P. C. P. Bhatt, "Efficient VLSI net-
works forparallel processing based on orthogonal trees," Dep. Elec. Eng.,
Indian Inst. Technol., New Delhi, India, 1981, unpublished.
[33] F. Preparata and J. Vuillemin, "The cube-connected cycles: A versatile
network for parallel computation," in Proc. 20th Annu. Symp. Found.
Comput. Sci., IEEE Comput. Soc., Oct. 1979, pp. 140-147.
[34] J. H. Reif and L. G. Valiant, "A logarithmic time sort for linear size
networks," in Proc. I5th Annu. ACM Symp. Theory Comput., Apr.
1983, pp. 10-17.
[35] A. L. Rosenberg, "Three-dimensional VLSI, I: A case study," in Proc.
CMU Conf. VLSI," Comput. Sci. Press, Oct. 1981, pp. 69-79.
[36] J. Savage, "Planar circuit complexity and the performance of VLSI
algorithms," in VLSI Systems and Computations, H. T. Kung, B.
Sproull, and G. Steele, Eds. Woodland Hills, CA: Comput. Sci. Press,
Oct. 1981.
[37] "Area-time tradeoffs for matrix multiplication and related
problems in VLSI models,' J. Comput. Syst. Sci., vol. 22, pp. 230-242,
Apr. 1981.
[38] C. L. Seitz, "Self-timed VLSI systems," in Proc. Caltech ConfVLSI,
Dep. Comput. Sci., California Inst. Technol., Pasadena, Jan. 1979, pp.
345-356.
[39] H. Stone, "Parallel processing with the perfect shuffle," IEEE Trans.
Comput., vol. C-20, pp. 153-161, Feb. 1971.
[40] Y. Tanaka, Y. Nozaka, and A. Masuyama, "Pipeline searching and
sorting modules as components of data flow database computer," in
Proc. Int. Fed. Inform. Processing, Oct. 1980, pp. 427-432.
[41] C. D. Thompson and H. T. Kung, "Sorting on a mesh-connected parallel
computer," Commun. Ass. Comput. Mach., vol. 20, pp. 263-271, Apr.
1977.
[42] C. D. Thompson, "A complexity theory for VLSI," Ph. D. dissertation
CMU-CS-80-140, Comput. Sci. Dep., Carnegie-Mellon Univ., Pitts-
burgh, PA, Aug. 1980.
[43] -, "Fourier transforms in VLSI," IEEE Trans. Comput., vol. C-32,
pp. 1047-1057, Nov. 1983.
[44] C. D. Thompson and D. Angluin, "On AT2 lower bounds forsorting,"
unpublished manuscript, May 1983.
[45] S. Todd, "Algorithm and hardware for a merge sort using multiple
processors," IBM J. Res. Develop., vol. 22, pp. 509-517, Sept. 1978.
[46] J. Vuillemin, "A combinational limit to thecomputing powerofVLSI
circuits," IEEE Trans. Comput., vol. C-32, pp. 294-300, Mar. 1983.
[47] L. E. Winslow and Y.-C. Chow, "Parallel sorting machines: Their speed
and efficiency," in Proc. AFIPS 1981 Nat. Comput. Conf., Fall 1981,
pp. 163-165.
[48] A. C. Yao, "Some complexity questions related to distributive com-
puting," in Proc. Ilth Annu. ACMSymp. Theory Comput., May 1979,
pp. 209-213.
[49] H. Yasuura, N. Takagi, and S. Yajima, "The parallel enumeration
sorting scheme for VLSI," Dep. Inform. Sci., Kyoto University, Kyoto,
Japan, Yajima Lab. Res. Rep. ER-81-03, Apr. 1982.
Clark D. Thompson, for a photograph and biography, see p. 1057 ofthe No-
vember issue ofthis TRANSACTIONS.
1184
Authorized licensed use limited to: The University of Auckland. Downloaded on April 16,2010 at 03:36:28 UTC from IEEE Xplore.  Restrictions apply. 