Improving the average delay of sorting  by Jakoby, Andreas et al.
Theoretical Computer Science 410 (2009) 1030–1041
Contents lists available at ScienceDirect
Theoretical Computer Science
journal homepage: www.elsevier.com/locate/tcs
Improving the average delay of sortingI
Andreas Jakoby a,∗, Maciej Liśkiewicz a, Rüdiger Reischuk a, Christian Schindelhauer b
a Inst. für Theoretische Informatik, Universität zu Lübeck, Germany
b Inst. für Informatik, Universität Freiburg, Germany
a b s t r a c t
In previous work we have introduced an average case measure for the time complexity of
Boolean circuits. Instead of fixed circuit depth, for each input we take the minimal number
of time steps necessary to perform the computation for that particular input using gates
that forward their output values as soon as possible. This measure is called delay. Based on
it, the complexity of awhole class of functions that can be described as prefix computations
has been analysed in detail.
Here we consider the problem to sort large integers that are given in binary notation.
Contrary to a word comparator sorting circuit C where a basic computational element, a
comparator, is chargedwith a single time step to compare two elements, in a bit comparator
circuit C ′ a comparison of two binary numbers has to be implemented by a Boolean
subcircuit CM called comparator module that is built from Boolean gates of bounded fanin.
Thus, compared to C, the depth of C ′ will be larger by a factor up to the depth of CM.
Our goal is to minimize the average delay of bit comparator sorting circuits. The worst-
case delay can be estimated by the depth of the circuit. For this worst-case measure two
topologically quite different designs seem to be appropriate for the comparator modules: a
tree-like one if the inputs are long numbers, otherwise a linear arrayworking in a pipelined
fashion. Inserting these into a word comparator circuit we get bit level sorting circuits for
binary numbers of length m, for which the depth is either increased by a multiplicative
factor of order log m or by an additive term of orderm.
We show that these obvious solutions can be improved significantly by constructing
efficient sorting and merging circuits for the bit model that only suffer a constant factor
time loss on the average if the inputs are uniformly distributed. This is done by designing
suitable hybrid architectures of tree compaction and pipelining. These results can also be
extended to classes of nonuniform distributions if we put a bound on the complexity of the
distributions themselves.
© 2009 Published by Elsevier B.V.
1. Introduction
1.1. Average case analysis for circuits
For circuits, depth is normally used to measure the time a computation takes. This is a worst case estimation. In [5] we
have defined an average case measure for the time complexity of circuits called delay. It has been observed that in many
I Supported by DFG research grant Re 672/3.∗ Corresponding author.
E-mail addresses: jakoby@tcs.mu-luebeck.de (A. Jakoby), liskiewi@tcs.mu-luebeck.de (M. Liśkiewicz), reischuk@tcs.mu-luebeck.de (R. Reischuk),
schindel@informatik.uni-freiburg.de (C. Schindelhauer).
0304-3975/$ – see front matter© 2009 Published by Elsevier B.V.
doi:10.1016/j.tcs.2008.10.028
A. Jakoby et al. / Theoretical Computer Science 410 (2009) 1030–1041 1031
Fig. 1. A sorting circuit C3 for 3 bits and the flow of bits for the input vector (1, 0, 0).
cases critical paths of a given circuit, e.g. paths between input and output gates of maximal length, have no influence on the
final output. Hence, the output values of the circuit for such inputs can be obtained much earlier.
The average delay of basic functions like OR, ADDITION, PARITY, and THRESHOLD has been estimated precisely. These
are special instances of the parallel prefix problem that have been investigated in detail in [4]. In many cases we have found
circuit designs that are exponentially faster on average than the optimal circuits for the worst-case [5,7,6]. On the other
hand, we could show lower bounds saying that for certain functions, e.g. PARITY, the average delay remains asymptotically
the same as in the worst case. A similar result holds for the problem to sort n bits that has worst-case complexityΘ(log n).
For the worst case, the lower bound follows from a simple counting argument, the upper bound has been established by a
nontrivial construction of Ajtai, Komlós and Szemerédi [1].
The delay of a sorting circuit may be smaller than its depth as can be seen in Fig. 1 showing a sorting circuit C3 for 3
elements. The first picture shows the circuit consisting of 3 comparators A, B, C . Its depth is 3, too, since the line in the
middle marked with input y goes through each comparator. The pictures show the flow of the inputs through the circuit
starting with time t = 0 when all inputs are at the left end. However, for the given input vector (1, 0, 0), on the critical
path in the middle there does not occur a delay 3. The reason is as follows: already in the first time step the lower 0 can
be passed through comparator B to its upper output line although the second input for B has not arrived yet. No matter,
what kind of bit this will be, comparator B can be set to an X that switches the inputs because we can be sure that a 0 must
occur at the upper output. In the second phase this allows comparator C to do its job since both its inputs are already there
and comparator B to finish its work by passing the input 1 on its upper line to the lower output line. Still, this saving in the
computation time has no asymptotic effect as we have shown in [5].
Fact 1. For uniformly distributed input bits the average delay of a sorting circuit over an arbitrary finite basis with gates of
bounded fanin that sorts n bits is at leastΩ(log n).
1.2. Fast sorting circutis: From word comparators to boolean gates
Thus, sorting n elements requires logarithmic time — even on the average. The complexity of sorting seems to be settled.
But what happens if we do not have to sort single bits, but long binary numbers. Obviously, the depth has to increase since
a binary circuit of bounded fanin cannot compare two long numbers in constant time.
Let n denote the number of elements that have to be sorted and let us startwith aword comparator circuit Cn. A comparator
has indegree and outdegree 2, and takes two elements of the sorting domain and outputs the minimum at the top and the
maximum at the bottom output node. Let depth(Cn) denote the depth of the circuit where each comparator is assumed to
have depth 1 (see Fig. 1).
If the elements to be compared are binary numbers of some length m we call this the (n,m)-sorting problem. Now let
us consider the physical realization of comparators. A comparator CMm that compares two m-bit numbers has to be built
on the bit level. Such a subcircuit we will call a comparator module. There are two obvious alternatives of how to design a
comparator module and combine it with the topology of a word comparator circuit.
On the one hand, one can compare two numbers x, y bitwise. Every bit comparison generates a result<,= oder> and
these results can be combined in binary tree-like fashion to determine the result ρ determiningwhich number is the smaller
or whether the numbers are equal. Each pair of bits of x, y is then routed to the appropriate output position by a switch that
1032 A. Jakoby et al. / Theoretical Computer Science 410 (2009) 1030–1041
Fig. 2. The sorting circuit C3,4 with comparator modules based on a tree structure derived from a word comparator circuit with the same topology as C3
(Fig. 1). The numbers on the edges denote the bit positions. The dark circles and squares represent special gates, resp. subcircuits (their specification will
be given in Section 3).
Fig. 3. The sorting circuit C ′3,4 with comparator modules implemented by linear arrays. The tilted sequence of bit positions on the output wires of a
comparator module shall indicate that with respect to depth, high order bits can earlier be output than lower order ones.
is driven by ρ. Assume that the combination of two bit comparison results can be performed by a subcircuit of depth δ.
Then such a comparator CMm can be implemented by a binary circuit of depth δ log m+O(1). This assumes no bound on the
fanout of a Boolean gate (if one insists on fanout at most 2 the depth becomes (δ + 1) log m+ O(1)). Thus, in total we get a
bit level sorting circuit Cn,m (see Fig. 2) of depth
(δ · log m+ O(1)) · depth(Cn).
Alternatively, one could compare the bits of twonumbers in a linear fashion startingwith the leading bits. This requires depth
linear in the number of bits, but has the advantage that after δ steps the leading bits of the two results of that comparator
are already known. This pipelined construction increases the depth of the sorting circuit Cn only by an additive term δ · m
resulting in a bit level sorting circuit C ′n,m (see Fig. 3) of depth at most
δ · (m+ depth(Cn)− 1).
A detailed discussion of sorting in the bit model can be found in Section 1.1.2 of [10]. Several papers have considered worst-
case delay of Boolean sorting networks explicitly. In [2] Al-Hajery and Batcher constructed bit serial bitonic sorting networks
(BBSN) of size O(n log n) that sort n numbers each of lengthm = O(log n) in O(log2 n) steps. BBSN is a periodic network of
depth O(log n) and size O(n log n) based on pipelining. The model of [2] differs, however, from the word comparator circuit
in that BBSN is a network of bit processors which has the same topology as bitonic sorting network and which processes
m-bit input strings in a bit serial fashion.
In [11] Leighton using methods due to Thompson [13] proved an Ω(n + m) lower time bound for (n,m)-sorting on a
(m× n)–array of bit processors. In addition, several papers have discussed VLSI architectures for sorting (e.g. [14,12,3,9]).
Using the topology of the asymptotically optimal AKS-network [1] of logarithmic depth and noticing that δ = O(1),
the pipelined construction by linear arrays described above sorts n numbers of length m ≤ O(log n) in O(log n) depth. If
m is asymptotically larger than n loglog n, tree-like comparator modules are better suited and give depth O(log n log m).
Because of huge constants, AKS-networks are only advantageous for very large n.
In [11] Leighton and Plaxton constructed butterfly-based sorting networks that sort correctly with probability close to 1.
These networks have depth 7.45 log n. An implementation of this topology with binary gates yields a randomized sorting
circuit of O(log n log m) depth with small constant factors and low error probability if a tree-like bit level sorting modules
are used (compare Fig. 2) or O(m+ log n) depth for modules implemented by linear arrays (see Fig. 3).
A. Jakoby et al. / Theoretical Computer Science 410 (2009) 1030–1041 1033
1.3. Hybrid designs for efficient comparator modules
In this paper, we investigate the (n,m)-sorting problem for m significantly larger than log n. New comparator modules
that are hybrid versions of the two basic topologies will be constructed that significantly speed up sorting networks on
average. Our main result for uniformly distributed inputs is the following:
Theorem 2. For every m, there are comparator modules CMm with the following property: Let a Boolean circuit Cn,m for the
(n,m)-sorting problem be derived from a word comparator circuit Cn by implementing its comparators as CMm modules. Then for
uniformly distributed inputs,
delay(Cn,m) ≤ O(depth(Cn)) with probability at least 1− 1/n.
Even in the worst case, the delay does not exceed O(depth(Cn) · log (n + m)). This implies that the average time complexity of
the bit-comparator circuit is at most a constant factor larger than the worst-case time complexity of the word comparator circuit
independent of the length of the binary numbers:
avg-time(Cn,m) ≤ O(depth(Cn)).
To prove this result, first wewill introduce the new comparator circuits called line tree comparator modules and then analyse
their timing behaviour.
These comparator modules use gates of unbounded fanout to spread information about comparison results fast. If one
imposes a constant fanout restriction the design has to be changed slightly and the delay bound increases to O(depth(Cn)+
log m). Still, the average delay remains on the same order as the word comparator circuit as long as the binary length of the
numbers is polynomially bounded.
Even in cases of strictly bounded fanout, for large numbersmwe achieve the best combination of the simple architectures
described above concerning the average delay: only a logarithmic increase of size log m instead of m, and this only by an
additive term rather than a multiplicative factor.
1.4. Sorting fast on average — Even for nonuniform distributions
These methods can also be applied in cases of nonuniformly distributed inputs. However, we cannot allow arbitrary
distributions since this would end up in a worst-case situation. Thus, we have to put a bound on the complexity of the
distributions themselves. Improved average delay bounds can be obtained for such families of nonuniform distributions of
low complexity. The technical details get more involved and require even more sophisticated designs for the comparator
modules — in this case called forest comparator module. We postpone the exact statement of this result to Section 5,
Theorem 17, but would like to stress at this point that these comparator modules do not use any knowledge about the
actual distribution, the design works uniformly for all complexity bounded distributions.
Organization of this paper
After this introduction and overview on the main results, in the next section we will briefly repeat the definition of the
asynchronous Boolean circuit model, the timing of gates and the complexity measure delay. The circuit design for the new
comparator modules will be described in Section 3. In Section 4 we construct specific bit comparator circuits for sorting
and merging and analyse their average delay for the uniform distribution. The extension to nonuniform distributions will
be described in Section 5.
2. Timing of a boolean circuit
Let C be a Boolean circuit, and Vin, Vout denote its input, resp., output gates. For a gate g and input x of C let resg(x) denote
the value that is generated by g on x. If g is the i-th input gate then resg(x) = xi. Otherwise, resg(x) is determined by the
values resgi(x) of its immediate predecessors gi and the type of g .
If we want to exploit possibilities to speedup the computation of a Boolean circuit the timing of gates has to be done in a
more sophisticated fashion. For this purpose, one has to extend the binary logic to indicatewhen a Boolean value is available.
How this can be done efficiently has been explained in [5]. To concentrate on the topological aspects of sorting circuits
here, we simply assume that each gate knows from its predecessors when their values are available.
Circuits that work in an asynchronous mode may not get all their input bits at the same time. To model and analyze this
we introduce the notion of starting-line. Similar to [8] we define the following functions to measure the timing.
Definition 3. A starting-line for C is a function S : Vin → N.
1034 A. Jakoby et al. / Theoretical Computer Science 410 (2009) 1030–1041
Fig. 4. S-gate and R-gate.
• Given a starting-line S for C , the function timeCS for pairs (g, x)where g is a gate of C and x an input is given as follows:
timeCS(g, x) :=

S(g) if g is an input gate,
0 if g is a constant gate,
1 + tg(x) else,
where tg(x) denotes the smallest time t , such that the values resgi(x) of those immediate predecessors gi of g with
timeCS(gi, x) ≤ t uniquely determine resg(x).• For the circuit C itself we define the timing by
timeCS(x) := maxg∈Vout time
C
S(g, x).
• Let timeC (x) denote the timing if the starting-line S is identically 0.
• Given a probability distribution µ on the input space, we define the average delay of C by
Etimeµ(C) :=
∑
x
µ(x) · timeC (x). 
Thus, timeCS(g, x) denotes the earliest moment when g knows its value assuming that the inputs are available according to
the starting time S.
Normally, all input bits are available at the beginning of a computation, that is at time 0. In the analysis following, we
also have to consider the case when some bits are delayed. For this purpose, for k ∈ [1 . . .m] let us define the function
σk : [1 . . .m] → N by
σk(j) :=
{
j− 1 if j ≤ k,
k+ 1 else.
3. Average case efficient comparator modules
We start with a description of the basic building blocks of the comparator modules. Let B = {0, 1} denote the binary
alphabet and Σρ := {LE, EQ, GT} an alphabet to specify the result of a comparison of two elements, numbers or bits: less,
equal, or greater.Σρ will suitably be coded overB— for example by the 3 vectors (1, 0, 0), (0, 1, 0), (0, 0, 1). In the following
x, y, u, v will always denote variables that hold a binary value and ρ, ρ1, ρ ′, . . . are variables that take values fromΣρ . The
Boolean sorting circuits will be constructed from 2 basic types of gates: sorting gates called S-gates, for short, and gates
computing the results of comparisons called R-gates, for short (see Fig. 4.).
• S-gate: it takes 3 inputs ρ, x, y and generates the 3 outputs u, v, ρ ′. The input–output relation is defined as follows:
for ρ = EQ and x < y: u = min{x, y} = x, v = max{x, y} = y, ρ ′ = LE,
for ρ = EQ and x > y: u = min{x, y} = y, v = max{x, y} = x, ρ ′ = GT,
for ρ = EQ and x = y: u = min{x, y}, v = max{x, y}, ρ ′ = EQ,
for ρ = LE: u = x, v = y, ρ ′ = LE,
for ρ = GT: u = y, v = x, ρ ′ = GT.
• R-gate: the inputs are ρ, ρ1, ρ2, the only output is ρ ′:
for ρ 6= EQ: ρ ′ = ρ,
for ρ = EQ and ρ1 6= EQ: ρ ′ = ρ1,
else ρ ′ = ρ2.
According to the Boolean basis and the coding of Σρ both types of gates can be realized by small subcircuits of some
fixed depth at most δ. For simplicity, through the rest of the paper we will assume that δ = 1, otherwise one has to include
A. Jakoby et al. / Theoretical Computer Science 410 (2009) 1030–1041 1035
Fig. 5. A line comparator module.
this as a constant factor to all the circuit bounds stated below. For an S-gate it is important to note that depending on its
input values, its 3 outputs may be available at different times. Thus, the timing information should rather be attached to
the output wires of a gate than to the gate itself. However, since this will not be important in the following, we stick to the
simpler model as specified by Definition 3.
Our circuit designs will alsomake use of simplified versions of an S-gate which will be used to construct lower and upper
levels of comparator modules. An L-gate, used for lower levels, is an S-gate where the ρ ′-output is not needed. An U-gate,
applied to onstruct upper levels, does not need the ρ-input and behaves as if this input were EQ. Furthermore, an U-gate
does not have the outputs u and v and thus the only output gate is ρ ′. Also some R-gates will have the input ρ be missing
working in such cases as if this input were EQ.
A comparator module CMm is a subcircuit built from S- and R-gates that takes two binary numbers x = x1 . . . xm and
y = y1 . . . ym and produces two output strings u and v such that u = min{x, y} and v = max{x, y}. We assume that x1,
resp. y1 are the leading bits of the binary numbers. In addition, CMm outputs the result ρ ∈ Σρ of the comparison, that is
either LE, EQ or GT. In the following, ρ will be called the compare info of CMm. First we will give a formal description of the
line- and tree-comparator modules considered in the Introduction.
A line comparator module LCMm is a comparator module consisting of a linear array of S-gates S1, . . . , Sm where each Si
gets the i-th bit of x and y and the compare result ρi−1 of Si−1 (see Fig. 5). For S1 we define ρ0 := EQ. Si outputs the two
bits ui and vi and ρi as the result of comparing the prefixes x1, . . . , xi and y1, . . . , yi. The compare info of CMm is ρm, i.e. the
compare result of the last Sm.
Even though some of the pairs ui, vimay be computed faster – namely if xi = yi – gate Si always has towait for the output
ρi−1 of its left neighbour in order to determine its output ρi. Since the ρi form a linear chain the following timing bounds
apply to line comparators.
Lemma 4. Let x, y be an arbitrary input pair for LCMm with starting-time S(xi), S(yi) ≤ i− 1 for all i ∈ [1 . . .m]. Then
timeLCMmS (Sj, (x, y)) = j for every j ∈ [1 . . .m].
Proof. Although this claim is easy to verify we give a proof to introduce the method that will be repeated in a more
complicated situation later. According to Definition 3, for each j ∈ [1 . . .m] it holds
timeLCMmS (Sj, (x, y)) = 1+ tSj(x, y),
where tSj(x, y) denotes the smallest time t , such that a subset of the input values xj, yj, and ρj−1 the member of which
satisfy the conditions S(xj), S(yj) ≤ t and timeLCMmS (Sj−1, (x, y)) ≤ t uniquely determines the outputs of Sj, where
timeLCMmS (Sj−1, (x, y)) gives the time when the output ρj−1 of gate Sj−1 is available. An S-gate needs all input values in order
to compute its 3 outputs. Thus, tSj(x, y) is just the maximum among the values S(xj), S(yj), and time
LCMm
S (Sj−1, (x, y)).
Now, we show by induction that timeLCMmS (Sj, (x, y)) = j. For j = 1 we get timeLCMmS (Sj, (x, y)) = 1 since S(x1), S(y1)
have to equal 0 and the input ρ0 = EQ is a constant value that is available at time t = 0, too. For j > 1, by assumption it
holds S(xj), S(yj) ≤ j − 1, and by induction hypothesis timeLCMmS (Sj−1, (x, y)) = j − 1. Thus, the observation above yields
timeLCMmS (Sj, (x, y)) = j. 
A tree comparator module TCMm makes all the comparisons of input pairs xi, yi in parallel by a sequence of U-gates
U1, . . . ,Um, and then combines their results ρ1, . . . , ρm by a binary tree of R-gates (restricted to 2-input gates) to obtain the
compare info ρ (see Fig. 6). The root of this tree will be denoted by Rˆ.
The compare info ρ at Rˆ is then used to drive m L-gates L1, . . . , Lm. Li either leads the two inputs xi, yi straightforward if ρ
equals LE or EQ, or exchanges their order otherwise.
1036 A. Jakoby et al. / Theoretical Computer Science 410 (2009) 1030–1041
Fig. 6. A tree comparator module.
Value ρ can either be forwarded to the Li directly if we allow unbounded fanout or we have to use another binary tree
to duplicate this information if the fanout is bounded. In the following we will consider the case of unbounded fanout,
otherwise in the timing bounds below one has to include another additive term log m.
In the following estimations, log m := dlog2mewill denote the binary logarithm rounded up. It is straightforward to see
that for the circuit TCMm, for all inputs (x, y) and arbitrary S it holds for all j ∈ [1 . . .m]
timeTCMmS (Lj, (x, y)) ≤ 2+ log m+max{S(xi), S(yi) | i ∈ [1 . . .m]}.
This follows easily from the fact that the timing of any circuit does not exceed the latest moment when all input values are
known plus the depth of the circuit. The reader may notice that for L1 gate one can actually prove
timeTCMmS (L1, (x, y)) = max{S(x1), S(y1)},
but this better bound will not be necessary for our further analysis.
To be more efficient in the average case we now define hybrid versions of these two architectures. They will depend on
an additional parameter k ∈ N. When applying to the sorting problem of n elements kwill typically be of order log n.
Definition 5. A k-line tree comparator module, LTCMm,k for short, consists of a line comparator module LCMk for the prefixes
x1, . . . , xk and y1, . . . , yk, and a tree comparator module TCMm−k for the suffixes xk+1, . . . , xm and yk+1, . . . , ym with the
following modification (see Fig. 7). The root Rˆ of the tree comparator additionally gets the compare info ρk of LCMk, and if
this result is EQ then it works as previously. Otherwise, Rˆ outputs this value as a result of the comparison between x and y
and propagates this value to the Li. 
A combination of the two timing bounds for LCM and TCMmodules gives
Lemma 6. For every input pair (x, y) and an arbitrary starting-line S fulfilling S(xi), S(yi) ≤ σk(i), it holds
time
LTCMm,k
S (Sj, (x, y)) = j for j ≤ k,
time
LTCMm,k
S (Lj, (x, y)) ≤
{
2+ log (m− k)+ k for j > k and ρk = EQ,
2+ k for j > k and ρk 6= EQ.
To achieve small average delay for nonuniform distributions we need an even more involved combination of trees and
pipelining.
Definition 7. A (k, λ)-forest comparator module FCMm,k,λ consist of k tree comparator modules T1, . . . Tk of type TCMλ and a
tree comparator Tk+1 of type TCMm′ for m′ := m − kλ (see Fig. 8). In addition, the roots Ri of the Ti form a linear array such
that Ri gets the compare info of Ri−1 as another input. 
This kind of modules will be useful only for small values of λ, typically less than loglog n.
Similar to the analysis above, one can show the following timing bounds.
Lemma 8. For all starting-lines S with S(xi), S(yi) ≤ σk(di/λe) and all input pairs (x, y) it holds
time
FCMm,k,λ
S (Lj, (x, y)) ≤
{
log λ+ di/λe − 1 for j ≤ k,
log λ+ k+ log(m− kλ) else.
A. Jakoby et al. / Theoretical Computer Science 410 (2009) 1030–1041 1037
Fig. 7. A k-line tree comparator module for k = 3.
Fig. 8. A (k, λ) forest comparator module for k = 3 and λ = 4; the last tree Tk+1 taking the remainingm− 12 input pairs is not drawn.
4. Average case delay for the uniform distribution
4.1. Conflicts and congestions
The previous section has shown that the delay of a comparator module depends on the length of the prefix up to which
its two inputs x and y are identical. Therefore, we make the following definition.
Definition 9. Let X = X1, . . . , Xn be a sequence of strings with X i = xi1 . . . xim ∈ {0, 1}m, k ∈ N, andw ∈ {0, 1}k. We call w
a conflict prefix of X if X contains at least two string X i, X j (i 6= j) with prefixw. Let
confk(X) = number of different conflict prefixes in X of length k.
A k-congestion of X is a subsequence X ′ of X such that all its members have identical prefixes of length k. Here, a subsequence
is allowed to be noncontiguous in the original sequence. We define
congk(X) = maximal size of a k-congestion of X ,
where size counts the number of elements of the subsequence X ′.
Obviously, the values confk(X) are monotonically decreasing with k. Furthermore, congk(X) = 1 means that all strings in
X have different prefixes of length k, thus there is no real congestion. A real congestion has to be of length at least 2.
1038 A. Jakoby et al. / Theoretical Computer Science 410 (2009) 1030–1041
In this section we assume that Xn,m is a uniformly distributed random variable generating an independent sequence
X1, . . . , Xn of binary numbers of lengthm each. We can upperbound conflicts and congestion as follows.
Lemma 10. For every k, β, γ ∈ N it holds:
Pr[confk(Xn,m) ≥ β] ≤ 2−β(k−2 log n),
Pr[congk(Xn,m) ≥ γ ] ≤ 2−(γ−1)(k−log n)+log n.
Proof. The probability for a large number of conflict prefixes can be estimated as follows:
Pr[confk(Xn,m) ≥ β] ≤ n(n− 1) . . . (n− 2β + 1)
β! (2
−k)β ≤ 22β log n−kβ .
The probability for a large congestion is bounded by
Pr[congk(Xn,m) ≥ γ ] ≤ 2k
(
n
γ
)(
1
2k
)γ
≤ 2k+γ log n−γ k. 
4.2. Line tree modules as universal comparators
Wewill show that k-line tree comparators are efficient building blocks for sorting circuits if the k-congestion of the input
strings is small. From the lemma above follows that for k ≥ 3 log n a real k-congestion does not occur with high probability:
Pr[cong3 log n(Xn,m) ≥ 2] ≤ 1/n.
On the other hand for k ≤ (1− ) log nwith  > 0, the k-congestion will typically be quite large. For the rest of this section
we will fix
k := 3 log n.
It suffices to consider the case that the length m of the numbers to be sorted is large compared to n, that means m ≥ k,
because otherwise there cannot be any k-congestion by definition and the claims following hold trivially.
Let X i, X j be inputs of an LTCMm,k. If the prefixes of length k of X i, X j are different then themodule can obtain the compare
info in O(k) steps. In this case, we say that the module gets the result fast.
Using the function σk introduced at the end of Section 2, we define a starting line Sk for a sorting circuit C as follows.
For i ∈ [1 . . . n] and j ∈ [1 . . .m] let xi,j denote the gate that gets the j-th input bit of the i-th number X i, and yi,j the
corresponding output gate. Then Sk(xi,j) = σk(j). Note that an input sequence X = X1, . . . , Xn with congk(X) = 1 can be
sorted by comparing the prefixes of length k and exchange the remaining part of the strings according to the compare info
of these prefixes.
Let C be a bit comparator circuit for the (n,m)-sorting problem that is obtained from a word comparator circuit Cn
by implementing its comparators as LTCMm,k modules. For a comparator g of Cn, let g1 . . . , gm denote the corresponding
sequence of output gates of the comparator module in C that replaces g . Moreover, let depthCn(g) denote the depth of the
module g in the circuit Cn.
Lemma 11. For every input sequence X, with congk(X) = 1, and every gj it holds
timeCSk(gj, X) ≤ σk(j)+ depthCn(g).
One can easily establish this bound by induction on the depth of a comparator module in Cn.
Lemma 12. Let C be a circuit for the (n,m)-sorting problem that is obtained from an arbitrary word comparator circuit Cn by
implementing its comparators as LTCMm,k. Then with probability at least 1− 1/n, for every output gate yi,j of C it holds
timeCSk(yi,j,X
n,m) ≤ σk(j)+ depth(Cn).
Proof. From Lemma 10 it follows for the chosen k that with probability at least 1− 1/n, there is no real k-congestion. If we
restrict ourselves to input sequences X with congk(X) = 1 the bound follows from the property
timeCSk(gj, X) ≤ σk(j)+ depthCn(g),
where gj denotes the j-th output gate of a comparator module of C , g its corresponding comparator in Cn, and depthCn(g)
the depth of g in Cn. 
The last lemma implies that all output gates can compute their values by time step
k+ depth(Cn)+ 1 ≤ 4 depth(Cn)+ 1,
using the fact that depth(Cn) ≥ log n. This proves Theorem 2.
If one requires that the fanout of our basic R-gates has to be bounded we have to add an out-tree to the root and obtain
a slightly larger delay bound of the form
timeCSk(yi,j,X
n,m) ≤
{
σk(j)+ depth(Cn) if j ≤ k,
σk(j)+ depth(Cn)+ log (m− k) else.
A. Jakoby et al. / Theoretical Computer Science 410 (2009) 1030–1041 1039
5. Average case delay for nonuniform distributions
This section will extend the previous results to nonuniform distributions. We have to bound the complexity of
distributions somehow, because otherwise the average case would equal the worst case. This is done within the circuit
model itself.
Definition 13. A distribution generating circuit is a Boolean circuit D of fanin and fanout at most 2. If D has r input gates and
n output gates it performs a transformation of a random variableZ uniformly distributed over {0, 1}r into a random variable
X over {0, 1}n. The input vector for D is chosen according to Z, and the distribution ofX is given by the distribution of the
values obtained at the output gates. 
In the following we will identify a distribution over {0, 1}n·m with a corresponding random vector variable X. Let X =
(X1, . . . , Xn)with X i = X i1 . . . X im ∈ {0, 1}m.
Definition 14. LetDn,m denote the set of all probability distributionsµ on {0, 1}n·m. Forµ ∈ Dn,m let Supp(µ) be the set of
all vectors X ∈ {0, 1}n·m with nonzero probabilityµ(X). We call a distribution inDn,m strictly positive if Supp(µ) = {0, 1}n·m
and letD+n,m denote the set of such distributions. Finally define
Depthn,m(d) := { µ ∈ D+n,m |∃ an r-input and (n ·m)-output Boolean circuit D of
depth d that transforms a uniformly distributed random
variable Z over {0, 1}r into a random variableX
with distribution µ, where r may be any integer}. 
By definition, Depthn,m(d) contains strictly positive probability distributions only. In our setting, where a single circuit
should have good average case behaviour for every distribution in this class, this is obviously necessary to exclude trivial
cases. Otherwise one could concentrate the probability mass on the worst-case inputs and average case complexity would
equal worst-case complexity. The same problem would arise if the distribution generating circuits may use gates of
unbounded fanin or fanout.
The depth bound on the distribution generating circuit implies the following property proved in [5]:
Fact 15. ForX ∈ Depthn,m(d) it holds:
• Pr[X ij = 1] is a multiple of 2−2d for every i, j;
• for every subset A of random variables X ij there exists a subset B ⊂ A of size at least |A|2−2d such that the elements of B are
pairwise independent.
To guarantee small average delay the k-congestion has to be low as seen above. Below we establish a bound on the
congestion of a random variable generated by a circuit of small depth.
Lemma 16. For d ≥ 1 andX ∈ Depthn,m(d) it holds:
Pr[confk(X) ≥ 1] ≤ 1n and Pr[congk(X) ≥ 2] ≤
1
n
for k ≥ 3 · 22d+1+2d+1 log n.
Proof. LetX = X1 . . . Xn ∈ Depthn,m(d) and Zi denote the prefix ofXi of length k. Z denotes the sequence of all Zi. Since
congk(X) = congk(Z)we can bound the probability that congk(X) ≥ 2 as follows:
Pr[congk(X) ≥ 2] ≤
∑
i1<i2
∑
w∈{0,1}k
Pr[Zi1 = w ∧ Zi2 = w]
≤
∑
i1<i2
∑
w∈{0,1}k
Pr[Zi1 = w] · Pr[Zi2 = w | Zi1 = w].
For
q := max
i1,i2,w
Pr[Zi2 = w | Zi1 = w]
one gets
Pr[congk(X) ≥ 2] ≤ q ·
∑
i1<i2
∑
w∈{0,1}k
Pr[Zi1 = w] = q · n · (n− 1)
2
.
It remains to establish an upper bound on q.
1040 A. Jakoby et al. / Theoretical Computer Science 410 (2009) 1030–1041
Let Zi = Zi1 . . .Zik and consider a pair (Zi1u ,Zi2u ) of corresponding bit positions. According to the construction of the
probability distribution of X, there are at most 2d random input gates of the distribution generating circuit D of X that
may influence the value ofZi1u and therefore, there are less than 22d variablesZ
ij
` that are not independent ofZ
i1
u . The same
observation also holds forZi2u . Hence, there are at least k/22d+1 pairs (Z
i1
u ,Z
i2
u ) that are pairwise independent. LetK [X, i1, i2]
denote a collection of k/22d+1 pairwise independent pairs (Zi1u ,Z
i2
u ). For
p := min
i1 6=i2∈{1,...,n},k∈K [X,i1,i2]
{Pr[Zi2u = 0 | Zi1 = 1], Pr[Zi2u = 1 | Zi1 = 0]}
we get
q ≤ (1− p)k/22d+1 ≤ exp(−p · k/22d+1),
where exp denotes the exponential function with base e. Note that the values of Zi1u and Z
i2
u are determined by at most
` ≤ 2d+1 random inputs of D. Since we assume that D generates a strictly positive probability distribution, there has to be
a least one assignment to the ` input variables such that Zi2u = 0 whereas Zi1 = 1 and at least one assignment such that
Z
i2
u = 1 whereas Zi1 = 0. Hence, we have
p ≥ 2−2d+1 .
Summarizing we get
Pr[congk(X) ≥ 2] ≤ exp(−p · k/22d+1) ·
n · (n− 1)
2
≤ 2−2−2
d+1 · k/22d+1 + 2 log n.
For k ≥ 3 · 22d+1+2d+1 log n this probability drops below 1/n.
Note thatcongk(X) < 2 implies thatconfk(X) = 0 for all sequencesX . Hence, our bound for the k-congestioncongk(X)
implies the bound for the number of different conflict prefixes confk(X) stated in the lemma directly. 
For small d, i.e. d = logloglog n, the bound given in Lemma 16 implies Pr[congk(X) ≥ 2] ≤ 1n for k ∈ Θ(log2 n · loglog n).
One should note that even with such a small depth bound d one can construct highly biased strings X (for example such
that Pr[X ij = 1] = 1/ log n) that in addition have a lot of dependencies among their bits.
Theorem 17. Let Cn be a word comparator sorting circuit and C a Boolean circuit for the (n,m)-sorting problem derived from
Cn by replacing its comparators by forest comparator modules FCMm,k,λ with k = depth(Cn) and λ ≈ log n loglog2 n. Then for
d < logloglog n andX ∈ Depthn,m(d), with probability greater than 1− 1/n it holds
timeC (X) ≤ 3 depth(Cn) logloglog n.
Proof. The (k, λ)-forest modules FCMm,k,λ are used mit parameters k = depth(Cn) and λ = (3 log2 n · dloglog2 ne)/k. Let
` := k · λ. Then by Lemma 16 it follows that with probability at most 1/n all binary input strings have different prefixes of
length `. If we assume that all input pairs X i, X s differ within a prefix of length ` then we can improve Lemma 8 as follows:
For all starting-lines S with S(X ij ), S(X
s
j ) ≤ σk(dj/λe) it holds
time
FCMm,k,λ
S (Lj, (X
i, X s)) ≤
{
log λ+ dj/λe − 1 for j ≤ k,
log λ+ k+ 1 else.
Moreover, by induction on the depth of a word-comparator circuit one can show that for every word-comparator circuit C
of depth h and for every Boolean circuit C ′ that is derived from C by implementing its word-size comparators as (k, λ)-forest
modules FCMm,k,λ it holds
timeC
′
S (Lj, (X
i, X s)) ≤
{
h · log λ+ dj/λe − 1 for j ≤ k,
h · log λ+ k+ 1 else.
Hence, with probability at most 1/n it holds
timeC (X) ≤ k+ 1+ depth(Cn) · log λ.
The claim follows directly. 
That a tiny depth bound is indeed necessary can be seen as follows. If we choose d ≥ loglog(n m)ε for an arbitrary
constant 0 < ε ≤ 1 one can constructX ∈ Depthn,m(d) such that for n,m large enough
Pr[congmε/2(X) ≥ nε/2] ≥
1
2
.
Moreover,X can be constructed such that with probability at least 1/2 the input strings X1, . . . , Xn
ε/2
have the same prefix
of length mε/2. Assume that two strings X i and X s have identical prefices of length mε/2. Then a (ks)-line tree comparator
module with k · s < mε/2 delays the sorting process by 2dlog (m − k · s)e + 1 ≥ dlog me steps. Hence, with probability
at least 1/2 the delay grows as least as log m log2 n. This shows that if the complexity of the distributions is significantly
increased then the average delay of an odd-even merge sort circuit built from line tree comparator modules converges to
the worst-case delay.
A. Jakoby et al. / Theoretical Computer Science 410 (2009) 1030–1041 1041
6. Conclusion
We have presented several topologies for bit level comparators. Replacing the comparators of a word level sorting circuit
by such modules yields Boolean sorting circuits that on the average are as fast.
The question arises whether similar results can be shown for other computational problems that can be realized on the
word as well as on the bit level. Selection looks like an interesting candidate in this respect. A first analysis indicates that
finding the minimum or maximum cannot be sped up significantly in the average case. However, we do not know whether
one can select the median of an input sequence more efficiently in the average case than in the worst case. The analysis of
this problem seems to be much more involved than in the case of minimum/maximum selection.
Also arithmetic circuits for calculating arithmetic expressions seem to be of interest. We expect that for simple addition
circuits a similar speed up effect as in the sorting case can be achieved. However, we conjecture that this is not possible if
addition and multiplication operations occur simultaneously.
References
[1] M. Ajtai, J. Komlós, E. Szemerédi, Sorting in c log n parallel steps, Combinatorica 3 (1983) 1–19.
[2] M. Al-Hajery, K. Batcher, On the bit-level complexity of bitonic sorting networks, in: Proc. 22. Int. Conf. on Parallel Processing, 1993, III.209–III.213.
[3] I. Hatirnaz, Y. Leblebici, Scalable binary sorting architecture based on rank ordering withlinear area-time complexity, in: Proc. 13. IEEE ASIC/SOC
Conference, 2000, pp. 369–373.
[4] A. Jakoby, Die Komplexität von Präfixfunktionen bezüglich ihres mittleren Zeitverhaltens, Dissertation, Universität zu Lübeck, 1998.
[5] A. Jakoby, R. Reischuk, C. Schindelhauer, Circuit complexity: From the worst case to the average case, in: Proc. 26. ACM STOC, 1994, pp. 58–67.
[6] A. Jakoby, R. Reischuk, C. Schindelhauer, Malign distributions for average case circuit complexity, in: Proc. 12. STACS, in: LNCS, vol. 900, Springer,
1995, pp. 628–639.
[7] A. Jakoby, R. Reischuk, C. Schindelhauer, S. Weis, The average case complexity of the parallel prefix problem, in: Proc. 21. ICALP, in: LNCS, vol. 820,
Springer, 1994, pp. 593–604.
[8] A. Jakoby, C. Schindelhauer, Efficient addition on field programmable gate arrays, in: Proc. 21. FSTTCS, 2001, pp. 219–231.
[9] Y. Leblebici, T. Demirci, I. Hatirnaz, Full-Custom CMOS realization of a high-performance binary sorting engine with linear area-time complexity, in:
Proc. IEEE Int. Symp. on Circuits and Systems 2003.
[10] T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes, Morgan Kaufmann Publishers, San Mateo, CA, 1992.
[11] T. Leighton, C.G. Plaxton, A (fairly) simple circuit that (usually) sorts, in: Proc. 31. IEEE FOCS, 1990, pp. 264–274.
[12] R. Lin, S. Olariu, Efficient VLSI architecture for columnsort, IEEE Trans. on VLSI 7 (1999) 135–139.
[13] C.D. Thompson, Area-time complexity for VLSI, in: Proc. 11. ACM STOC 1979, pp. 81–88.
[14] C.D. Thompson, The VLSI complexity of sorting, IEEE Trans. Comput. 32 (1983) 1171–1184.
