An efficient counting network  by Busch, Costas & Mavronicolas, Marios
Theoretical Computer Science 411 (2010) 3001–3030
Contents lists available at ScienceDirect
Theoretical Computer Science
journal homepage: www.elsevier.com/locate/tcs
An efficient counting networkI
Costas Busch a,∗, Marios Mavronicolas b,1
a Department of Computer Science, Louisiana State University, 286 Coates Hall, Baton Rouge, LA 70803, USA
b Department of Computer Science, University of Cyprus, 75 Kallipoleos Street, P.O. Box 537, CY-1678 Nicosia, Cyprus
a r t i c l e i n f o
Article history:
Received 16 October 2006
Received in revised form 1 June 2009
Accepted 12 April 2010
Communicated by P. Spirakis
Keywords:
Counting network
Balancing network
Contention
Shared memory
Distributed data structure
a b s t r a c t
We present a novel counting network construction, where the number of input wires w
is smaller than or equal to the number of output wires t . The depth of our network is
Θ(lg2 w), which depends only on w. In contrast, the amortized contention of the network
depends on the number of concurrent processes n and the parametersw and t . This offers
more flexibility than all previously known networks, with the same number w of input
and output wires, whose contention depends only on two parameters, w and n. In case
n > w lgw, by choosing t > w lgw the contention of our network is O(n lgw/w), which
improves by a logarithmic factor ofw over all previously known networks withw wires.
© 2010 Elsevier B.V. All rights reserved.
1. Introduction
1.1. Background
A fundamental problem in distributed computing is the efficient implementation of a shared counter. In the shared
counter problem, the distributed processes access the counter through Fetch&Increment operations in order to obtain
successive integer values from a given range. Distributed problems such as load balancing and barrier synchronization can
be expressed and solved as counting problems. In a seminal work, Aspnes et al. [5] introduced counting networks as a class
of distributed data structures used to construct concurrent, low-contention implementations of distributed counters that
support the Fetch&Increment operation. A survey on counting networks can be found in [8].
Counting networks are constructed from p-input q-output asynchronous switches called (p, q)-balancers [1,5,14,15]; p is
the balancer’s input width, while q is its output width. As illustrated in Fig. 1, a balancer accepts a stream of tokens on its p
input wires (some of the notation given in the figure, such as xi, yi, will be introduced later in Section 2.2). The tokens arrive
asynchronously to the balancer and the balancer processes a token at a time in an atomic operation so that the i-th token to
be processed by the balancer leaves on output wire i mod q (where i = 0, 1, . . .). In the same figure, we show the number
of tokens that enter on each input wire and leave from each output wire. A balancer for which p = q will be called regular,
while a balancer for which p 6= qwill be called irregular.
A balancing network [5], denoted B, is an acyclic network of balancers, where the output wires of the balancers are
linked to input wires of others; see Fig. 1 for an illustration. The network’s input wires are those input wires not linked to
I A preliminary version of this work appears in the Proceedings of the 1st Merged International Parallel Processing Symposium and Symposium on Parallel
and Distributed Processing (IPPS/SPDP’98), pp. 380–384, Orlando, Florida, March/April 1998.∗ Corresponding author. Tel.: +1 225 578 7510; fax: +1 225 578 1465.
E-mail addresses: busch@csc.lsu.edu (C. Busch), mavronic@ucy.ac.cy (M. Mavronicolas).
1 Tel.: +357 22 892702; fax: +357 22 892701.
0304-3975/$ – see front matter© 2010 Elsevier B.V. All rights reserved.
doi:10.1016/j.tcs.2010.04.023
3002 C. Busch, M. Mavronicolas / Theoretical Computer Science 411 (2010) 3001–3030
Fig. 1. Left: a (4, 6)-balancer; Right: a balancing network of input width 4 and output width 8.
Fig. 2. Regular balancing networks of width 4 and 8 built from (2, 2)-balancers.
any balancer’s output, and similarly for the network’s output wires. The number of input wires is called the network’s input
width, denotedw; the number of output wires is called the network’s output width, denoted t .
A balancing network in which each balancer is regular will be called a regular network. In a regular network it holds that
the input width is equal to the output width, that is, w = t , which we will simply refer to as the width of the network.
If a network uses irregular balancers, it will be called irregular. Note that in irregular networks it may be that w 6= t . For
example the network in Fig. 1, is irregular. Examples of regular balancing networks are shown in Fig. 2.
Tokens enter the balancing network on the input wires, typically several per wire, propagate asynchronously through
the balancers, and leave on the output wires, typically several per wire. The depth of the balancing network is the maximum
number of balancers that any token has to traverse from an input wire to an output wire. The depth of a balancing network
determines its latency, which is a delay due to the physical characteristics of a balancing network. A significant source of
delay are token collisions (contention), which we discuss below.
A balancing network is a counting network [5] if the overall distribution of output tokens across the output wires satisfies
the step property: exiting tokens are divided uniformly among the outputwires,while any excess tokens emerge on the upper
wires. The balancing networks illustrated in Figs. 1–3 are all counting networks. In Fig. 1, we show a particular distribution
of tokens on the input and output wires; note that the step property holds on the distribution of tokens across the output
wires.
Theprimarypurpose of a countingnetwork is to support distributed Fetch&Incrementoperations. Each token corresponds
to a request by a process to increment a distributed counter. Suppose that the output width of a counting network is t . At
each output wire i on the counting network there is a variable vi that assigns counter values to the tokens. The initial value
of vi is i. If a token τ exits on wire i, in an atomic operation, τ is assigned the value vi, and the value of vi is increased by t . If
in totalm tokens traverse the network, each token is assigned a counter value between 0 andm− 1. For example, in Fig. 1,
we show the respective counter value that is assigned to each token that exits from the counting network (on the right part
of the figure), and similarly we depict the token values that exit from the (4, 6)-balancer (on the left part of the figure).
1.2. Contention
On an MIMD shared memory multiprocessor machine, the balancing network B is implemented as a shared data
structure. Each balancer is a memory location shared by all the processes; wires are pointers from one memory location to
another. The memory location of the balancer contains the necessary information that determines the wire the next token
C. Busch, M. Mavronicolas / Theoretical Computer Science 411 (2010) 3001–3030 3003
Fig. 3. Partition of network C(w, t) into blocksNa ,Nb andNc .
will exit from; this information may be accessed by any process’s token. Each of the machine’s n asynchronous processes
runs a program that repeatedly traverses the data structure from some input pointer to some output pointer, each time
shepherding a new token through the network. Call n the concurrency.
Tokens generated by process pl, l ∈ {0, . . . , n − 1}, enter the network on input wire l mod w, where w is the input
width of the network. Since each process can have at most one token traversing the network at any time, the total number
of tokens simultaneously traversing the network is no more than n. Note that in the whole history of an execution, the total
number of tokens that will traverse a network may exceed n, since a process may issue a token several times.
Contention in balancing network B occurs when two or more tokens are trying to access the same balancer
simultaneously. In such a case, the tokens contend forwhich onewill atomically access thememory location of the balancer;
all unsuccessful tokens must wait and try again. Each time a token passes through a balancer, it causes a stall to all other
tokens waiting at the balancer. Equivalently, every time a token is bypassed by another token, a stall is incurred to it.
The number of total stalls has been proposed by Dwork et al. [12] as a complexity-theoretic measure of contention in
shared memory algorithms. Roughly speaking, the contention incurred by the traversal of m tokens through the network
B at concurrency n, denoted cont(B, n,m), is the maximum number of stalls, over all possible executions, induced by an
3004 C. Busch, M. Mavronicolas / Theoretical Computer Science 411 (2010) 3001–3030
adversary scheduler. The amortized contention of the networkB at concurrency n, denoted cont(B, n), is the limit supremum
of cont(B, n,m) divided bym, asm goes to infinity:
cont(B, n) = lim sup
m→∞
cont(B, n,m)
m
.
Naturally, the higher the (amortized) contention, the smaller the network’s token throughput. Clearly, amortized
contention is an appropriate measure of the average delay experienced by any token traversing a network. The amortized
contention measure is both simple and practical in the sense that the only parameters that turned out to be needed for its
analysis are the concurrency n and the width of the network. Even more so, it does not require any timing information on
the arrival and departure time of tokens, as more complicated queueing theory models do.
1.3. Contribution
Almost all known counting network constructions are regular and they are built from regular balancers (see, e.g.,
[1,5,7,9,14]). The prime example of such networks is the bitonic counting network [5, Section 3]; built from (2, 2)-balancers,
it achieves input and output width w = 2k, for any integer k > 0. The depth of the bitonic network is Θ(lg2w), while its
amortized contention isΘ(n lg2w/w) [12, Section 3.2]. Another notable construction with similar structural characteristics
is the periodic counting network [5, Section 4], which achieves amortized contention O(n lg3w/w) [12, Section 3.4].
In this work, we depart from the regular approach to counting networks, and we build irregular counting networks. The
principalmotivation for our study is to improve the efficiency of counting networks by relaxing regularity. More specifically,
we are interested in understanding whether, and by how much, irregular networks may improve on efficiency regarding
amortized contention, at the same level of latency (network depth) and concurrency, over their regular counterparts.
1.3.1. Bounds
We present a novel construction of an irregular counting network C(w, t), where the input width w is smaller than or
equal to the output width t . Specifically, w = 2k and t = p · w, for any integers k, p > 0. Our network is constructed from
(2, 2)-balancers and (2, 2p)-balancers. The depth of network C(w, t), depends solely on the input width w; it isΘ(lg2w).
Note that this depth is exactly the same as the depth of a bitonic network of width w. For example, the network C(4, 8) is
illustrated in Fig. 1, the network C(8, 16) in Fig. 3, and the regular networks C(4, 4) and C(8, 8) in Fig. 2.
We discover that the amortized contention of the network C(w, t) depends on all three parametersw, t and n:
cont (C(w, t), n) = O
(
n lgw
w
+ n lg
2w
t
+ w lg
3w
t
+ lg2w
)
.
Since the amortized contention is now determined by three parameters, we expect this to offer some more flexibility and
trade-offs when one must choose the right network for the specific needs of any particular counting problem. Apparently,
our network provides more options than previous (regular) networks, like the bitonic and the periodic, whose contention
depends only on two parameters (w and n).
To demonstrate this additional flexibility, but also to compare our network with previous constructions, we adjust the
output width t for achieving efficiency.
• When t = w, we obtain a regular counting network family C(w,w), with depth Θ(lg2w) and amortized contention
O(n lg2w/w + lg3w). For n ≥ w lgw, the amortized contention becomes O(n lg2w/w), which is the same as in the
bitonic network of widthw. We would like to note that this is a new regular network family which is different from the
bitonic and periodic counting networks.
• By increasing the output width t the contention of network C(w, t) decreases, while its depth remains the same (since
it only depends onw).
– Specifically, by taking t = w lgw the amortized contention becomes O(n lgw/w + lg2w). For n ≥ w lgw
the amortized contention becomes O(n lgw/w), which is better by a logarithmic factor of w over the amortized
contention of the bitonic counting network with the same depth and width w. Therefore, our network can handle
the higher concurrency better.
Since our network achieves a decreased amortized contention as a function of the concurrency n, we naturally expect that
it offers the option of a higher throughput for the same latency. No such options were available for any of the previously
known networks.
An experimental analysis that compares our counting network with the bitonic and periodic networks has been con-
ducted in [19,20]. These experimental results validate our theoretical analysis, and demonstrate that under high concur-
rency, when choosing larger outputwidth than the inputwidth, our network outperforms the bitonic and periodic networks
of same input width. The experiments were performed both with simulations and a real system implementation using 10
Sun UltraSparc-10 workstations.
C. Busch, M. Mavronicolas / Theoretical Computer Science 411 (2010) 3001–3030 3005
1.3.2. Structural interpretation
The construction of the counting network C(w, t) is recursive. We first (recursively) construct two counting networks
C(w/2, t/2). Then, we combine their outputs with a merging network M(t, δ), which is a regular balancing network of
width t and produces the final output of the counting network C(w, t) (see Fig. 10). The parameter δ denotes the maximum
difference on the number of tokens that exit on the two counting networksC(w/2, t/2). A characteristic of our construction
is thatweuse a layer of balancers before the networksC(w/2, t/2) in order to bound δ ≤ w/2. Themerging networkM(t, δ)
takes advantage of this bound on δ in order to achieve depth lg δ = O(lgw). Using this property recursively, the total depth
of C(w, t) isΘ(lg2w), which depends only on the input widthw.
The use of a recursive construction with merging networks is a fundamental technique in building counting networks
[5,9,14] and similar data structures [6,23] (sorting networks). The best example is the bitonic counting network [5]. The key
difference of our network from the other known recursive constructions, is that the depth of our merging network depends
only on the maximum difference δ of the outputs of the counting networks to be merged. In all the other constructions, the
depth of the merging network depends on the output widths of the counting networks to be merged. For example, if in our
counting network construction we used the bitonic merging network instead of ours, in order to merge the outputs of the
two counting networks C(w/2, t/2) of output width t/2, we would require a merging network of depth lg t . This would
result in a total counting network depth Θ(lg2 t), which depends on the output width t . In contrast, using our merging
network, the total depth depends only on the input widthw. We would like to note here again that havingw different than
t gives the flexibility to design a counting networkwherew and t can be chosen appropriately to achieve smaller contention.
We also want to note that the resulting network of our construction is different than the one of the bitonic network when
w = t .
We attribute the improved contention performance enjoyed by our network to some features of its unique structure.
When we unfold the recursive construction of network C(w, t) and we look inside its structure, we identify three series of
networks blocks,Na,Nb, andNc , as illustrated in Fig. 3:
• Block Na has input and output width w and depth lgw − 1; it is built from (2, 2)-balancers. This block is regular.
It corresponds to the balancers placed before the recursively built counting networks in order to bound their output
difference.
• BlockNb has inputwidthw, outputwidth t , and depth 1; it is built from (2, 2p)-balancers. This block serves as a transition
block from block Na to block Nc . This block is irregular. It corresponds to the balancers at the basis of the recursion
(network C(2, 2p)).
• Block Nc has input and output width t , and depth Θ(lg2w); it is built from (2, 2)-balancers. This block is regular. It
corresponds to all the merging networks used in the recursion.
The dominant block with respect to depth is blockNc . Thus, intuitively, tokens spend most of their time in this block of
the network. It is therefore expected that contention will be heavily influenced from parameters of this block. By increasing
the output width t , block Nc becomes wider and fatter with respect to the number of balancers, so that tokens there will
then have less chance to collide at the same balancer. Consequently, as t increases, the contention in block Nc decreases,
so that the contention of the entire network decreases. Even more so, by unboundedly increasing t , the contention in block
Nc approaches lg2w (independent of the concurrency n). However, for a fixed w, as t becomes large, blockNa remains the
same; thus, block Na will be the one to determine the network’s contention when w  t . Nevertheless, since the depth
of block Na is only Θ(lgw), block Na cannot affect the performance of the entire network very much, so that the gain in
contention due to increasing t is preserved.
We remark that increasing t causes a corresponding increase to the number of balancers in block Nc . For really large
values of t , this may seem to cause a resource burden when implementing the network in a real shared memory multipro-
cessor system. Thus, there is an implementation tradeoff between the two cases w = t and w  t . This tradeoff will have
to depend on the particular intricacies and requirements of the counting problem in hand. A compromise where t = w lgw
seems to provide a reasonable solution.
1.4. Related work
1.4.1. Comparison to other irregular networks
There are only two other known irregular counting networks. The first one, called a diffracting tree, is given by Shavit
and Zemach [26]; built from (1, 2)-balancers, it has the form of a binary tree with 1 input wire, w output wires, and depth
lgw. This construction employs randomization to implement a diffraction scheme, which allows a pair of colliding tokens
to combine and eliminate themselves. Experiments have revealed some nice performance results for this construction;
similar results have also been in [25] for the steady state, and under certain probabilistic assumptions on the frequency
of traversals. Nevertheless, the amortized contention of the diffracting tree is Θ(n), since it is possible for an adversary
scheduler to accumulate all tokens at the root of the tree.
The second irregular construction is given by Aiello et al. [3]; it has input width w, output width w lgw, and depth
O(lgw); it is built from (2, 2)-balancers and (1, 2)-balancers. This construction uses as a building block the AKS sorting
network [4], whose depth is O(lgw), but the asymptotic depth notation hides huge constant factors, which makes this
3006 C. Busch, M. Mavronicolas / Theoretical Computer Science 411 (2010) 3001–3030
counting network construction to be of no practical interest.2 On the other hand, our construction has small constants in
the asymptotic notation of its depth and can be easily implemented in practice.
1.4.2. Other related work
Aharonson and Attiya [1] consider irregular balancers and balancing networks in terms of impossibility results of their
feasible constructions. In particular, they prove that a counting network with output width w cannot be constructed by
balancers of outputwidth b1, . . . , bk, if there exists a prime factor p ofw such that p does not divide bi for all i, 1 ≤ i ≤ k (this
result actually holds for smoothing networks, which is a more general class than counting networks). Similar impossibility
results for regular counting networks have been also studied in [10]. Irregular balancing networks have also been considered
in [2], where it is shown that counting networks can immediately support Fetch&Decrement operations together with
Fetch&Increment operations. The possibility of using counting networks to support arbitrary read-modify-write operations
has been studied in [11,13].
In [16] the authors consider linearizable counting networks, where tokens get values that reflect the time order that they
access the counting network, that is, new tokens get higher values than tokens that have already traversed the network. The
authors considered wait-free implementations where, tokens failing in one location (balancer) do not prohibit the progress
of tokens in other locations of the network. In that paper the authors have studied the effects of linearizability with respect
to contention, latency (depth) and wait-freedom. They showed an interesting lower bound that says that low contention
and wait-free linearizable counting networks must have at least Ω(n) latency. Thus, linearizability comes at a high cost
in network depth. In the networks that we consider here we do not have the same tradeoffs since we do not consider
linearizability. Note however, that our networks and all the known non-linearizable counting network constructions are
wait-free.
1.5. Road map
In Section 2, we offer some necessary preliminaries for integer sequences, and give some definitions and basic results
for balancing networks. We give in Section 3 the construction of amerging network, which is a major building block for the
construction of our counting network in Section 4. In Section 5, we give the butterfly network, which is a special network
that will be useful for the analysis of the amortized contention of our counting network in Section 6. Finally, in Section 7 we
discuss our results and give some open problems.
2. Preliminaries
2.1. Sequences
The number of tokens that enter or exit a balancing network will be represented with sequences. We denote an integer
sequence of length w with a boldface letter such as x(w). The elements of the sequence are denoted with small letters; so,
x(w) = x0, x1, . . . , xw−1. The sum of the elements of the sequence is denoted as∑(x(w)) = x0 + x1 + · · · + xw−1. The
maximum value is maxi(xi), while theminimum value is mini(xi).
A subsequence of x(w) is any sequence of elements xi0 , xi1 , . . . , xik−1 , such that ij < ij+1, for all 0 ≤ j < k. The even
subsequence of x(w) is x(w/2)e = x0, x2, . . ., while the odd subsequence is x(w/2)o = x1, x3, . . .. Ifw is even, the first half of x(w)
is subsequence x0, x1, . . . , xw/2−1, while the second half of x(w) is subsequence xw/2, xw/2+1, . . . , xw−1.
A sequence x(w) has the k-smooth property [1,5] if |xi − xj| ≤ k for any pair of indices i and j such that 0 ≤ i, j < w;
we say also that the sequence x(w) is k-smooth. Notice that the elements of any k-smooth sequence take values in a range
a, a+ 1, . . . , a+ k, for some integer a.
A particular kind of 1-smooth sequences are the step sequences. A sequence x(w) has the step property [5] if 0 ≤ xi−xj ≤ 1
for any pair of indices i and j such that 0 ≤ i < j < w; alternatively, we say that the sequence x(w) is step. For a step sequence
x(w), its step point is either the unique index i such that xi < xi−1, orw if all xi are equal; that is, 1 ≤ i ≤ w, For any element
xi of a step sequence x(w), it holds that [5]:
xi =
⌈∑(
x(w)
)− i
w
⌉
. (1)
We continue with two basic results for step sequences. (The following result has also been proven in [5, Lemma 3.1].)
Lemma 2.1. Any subsequence of a step sequence is step.
Proof. Suppose that x(w) is a step sequence. Let x̂ (m) = xi0 , xi2 , . . . , xim−1 be a subsequence of x(w). We have that 0 ≤
xij − xik ≤ 1, for all 0 ≤ j < k < m, since x(w) is step. Therefore, x̂ (m) is step too. 
2 In [21,22] it is shown that there is a regular counting network of depth O(lgw)which is constructed using the AKS sorting network.
C. Busch, M. Mavronicolas / Theoretical Computer Science 411 (2010) 3001–3030 3007
Lemma 2.2. Consider a pair of step sequences x(w) and y(w), where w ≥ 2, with maximum values a and b, respectively. If there
is an integer δ such that
0 ≤
∑(
x(w)
)−∑(y(w)) ≤ δ,
then,
0 ≤ a− b ≤
⌊
δ
w
⌋
+ 1.
Proof. Since a and b represent the maximum values of the step sequences x(w) and y(w), respectively, it holds that
w(a− 1) <
∑(
x(w)
) ≤ wa,
and
w(b− 1) <
∑(
y(w)
) ≤ wb.
By subtracting these two inequalities, we get
w(a− b− 1) <
∑(
x(w)
)−∑(y(w)) < w(a− b+ 1).
By inequality
0 ≤
∑(
x(w)
)−∑(y(w)) ≤ δ,
we get that
w(a− b− 1) < δ,
and
w(a− b+ 1) > 0.
Sincew ≥ 2, it follows that
−1 < a− b < δ
w
+ 1.
Since a and b are integers, this implies that
0 ≤ a− b ≤
⌊
δ
w
⌋
+ 1,
as needed. 
Next, we give properties for the even and odd subsequences of step sequences. (A similar result has also been proven in
[5, Lemma 3.2].)
Lemma 2.3. If x(w) is a step sequence, andw is even withw ≥ 2, then
0 ≤
∑(
x(
w
2 )
e
)
−
∑(
x(
w
2 )
o
)
≤ 1.
Proof. Let a be the maximum value of x(w). Let k be the step point of x(w). All the elements xi with i < k have value a, while
the remaining elements have value a − 1. Thus, all elements x0, . . . , xk−1 have value a, while all elements in xk, . . . , xw−1
have value a− 1.
If k is even, we have x2i = x2i+1 = a, for i < k/2, while x2i = x2i+1 = a− 1, for i ≥ k/2. Thus, x(w)e = x(w)o , which implies∑(
x(
w
2 )
e
)
−
∑(
x(
w
2 )
o
)
= 0.
If k is odd, we have x2i = x2i+1 = a, for i < (k − 1)/2, while x2i = x2i+1 = a − 1, for i > (k − 1)/2. Further,
a− 1 = xk < xk−1 = a. Thus, x(w)e and x(w)o differ only on their kth element, which implies∑(
x(
w
2 )
e
)
−
∑(
x(
w
2 )
o
)
= 1. 
Lemma 2.4. Consider two step sequences x(w) and y(w), wherew is even andw ≥ 2. If there is an even integer δ such that
0 ≤
∑(
x(w)
)−∑(y(w)) ≤ δ,
3008 C. Busch, M. Mavronicolas / Theoretical Computer Science 411 (2010) 3001–3030
then
0 ≤
∑(
x(
w
2 )
e
)
−
∑(
y(
w
2 )
e
)
≤ δ
2
,
and
0 ≤
∑(
x(
w
2 )
o
)
−
∑(
y(
w
2 )
o
)
≤ δ
2
.
Proof. Denote
A =
∑(
x(w/2)e
)−∑(y(w/2)e ) ,
B =
∑(
x(w/2)o
)−∑(y(w/2)o ) .
We will show that 0 ≤ A ≤ δ/2 and 0 ≤ B ≤ δ/2. We have:∑(
x(w)
) =∑(x(w2 )e )+∑(x(w2 )o ) ,∑(
y(w)
) =∑(y(w2 )e )+∑(y(w2 )o ) .
Since by assumption, 0 ≤∑(x(w))−∑(y(w)) ≤ δ, it follows that
0 ≤
(∑(
x(
w
2 )
e
)
+
∑(
x(
w
2 )
o
))
−
(∑(
y(
w
2 )
e
)
+
∑(
y(
w
2 )
o
))
≤ δ
or
0 ≤ A+ B ≤ δ. (2)
From Lemma 2.3,
0 ≤
∑(
x(
w
2 )
e
)
−
∑(
x(
w
2 )
o
)
≤ 1, (3)
and
0 ≤
∑(
y(
w
2 )
e
)
−
∑(
y(
w
2 )
o
)
≤ 1. (4)
By subtracting Inequalities (3) and (4) we get
−1 ≤
(∑(
x(
w
2 )
e
)
−
∑(
y(
w
2 )
e
))
−
(∑(
x(
w
2 )
o
)
−
∑(
y(
w
2 )
o
))
≤ 1
or
−1 ≤ A− B ≤ 1. (5)
By adding Inequalities (2) and (5), we get
−1
2
≤ A ≤ δ
2
+ 1
2
,
and by subtracting Inequalities (2) and (5), we get
−1
2
≤ B ≤ δ
2
+ 1
2
.
Since A and B are integers and δ is even, we get
0 ≤ A ≤ δ
2
,
and
0 ≤ B ≤ δ
2
,
as needed. 
2.2. Balancing networks
Consider a balancing network B of input width w and output width t . Each balancer b in B has a depth which is the
length of the longest path, in terms of number of balancers, that a token has to traverse from an input wire of B until the
C. Busch, M. Mavronicolas / Theoretical Computer Science 411 (2010) 3001–3030 3009
token reaches an output wire of balancer b. The depth of the network, denoted depth(B), is the maximum depth of any
balancer inB. Suppose that d = depth(B). NetworkB can be decomposed into d layers of balancers, `1, . . . , `d, such that
layer `i contains all the balancers of depth i. Note that a layer is itself a balancing network of depth 1. Note also that in a
regular balancing network all the layers have input and output width equal to the width of the regular network.
Consider now a (p, q)-balancer b. At any moment during an execution where tokens traverse b, the balancer b has a state
which is the index of the output wire on which b will forward the next token that it processes. Thus, the state is a number
in {0, . . . , q− 1}. A transition α(τ , b) is the action of taking a token τ from the input and forwarding it to an output wire of
b. The transition increases the state of the balancer by one (in a modulo-q operation).
Consider now a balancing network B with balancers b1, . . . , bk. The state of the network is the collection of the states
of its balancers. Each transition brings the network from one state to another. An execution E in B that involves tokens
τ1, . . . , τm can be represented as a sequence of transitions E = α(τi1 , bi1), α(τi2 , bi2), . . . , α(τik , bik), where k denotes the
length of the execution (number of transitions). For parallel transitions, where one transition does not cause the other, the
relative order is not important in the sequence of transitions in E. However, for causal transitions (non-parallel transitions),
where one transition causes the other, the execution order is preserved in the sequence of transitions in E. At the end of
the last transition in the execution (transition α(τik , bik)), there are no tokens traversing the network. In this case we say
that the network has reached a quiescent state. Note that we could have infinite executions, however, the behavior of the
balancing network is determined in quiescent states, at the end of finite executions.
Consider a (p, q)-balancer b in a quiescent state. Let xi denote the number of tokens that have entered the balancer on
input wire i. The sequence x(p) = x0, . . . , xp−1 is the input sequence to balancer b (see Fig. 1). Let yi denote the number of
tokens that have left from output wire i of balancer b. The sequence y(q) = y0, . . . , yq−1 is the output sequence of balancer b.
The output sequence y(q) satisfies the step property. The input and output sequences satisfy the sum preservation property,∑
(x(p)) = ∑(y(q)), which expresses the fact that in a quiescent state all tokens that have entered the balancer have also
left it.
Since y(q) is step, fromEq. (1) it holds for any outputwire i that yi =
⌈∑(
y(q)
)
−i
q
⌉
. Therefore, y(q) is a function of thenumber
of tokens
∑
(x(p)) that have gone through the balancer. From the sum preservation property, we have yi =
⌈∑(
x(p)
)
−i
q
⌉
.
Therefore, the output sequence y(q) is a function of the number of tokens that have entered b. Thus, any two executions that
involve balancer b such that the same number of tokens traverse the balancer starting from the same balancer state, will
leave the balancer in the same state with the same values on the output sequence y(q).
Consider now a balancing network of input width w and output width t . For any quiescent state, we define the input
sequence x(w) and output sequence y(t) ofB, similarly as for the balancer (see Fig. 1). It can easily be shown by induction on
the layers that B satisfies the sum preservation property:
∑
(x(w)) = ∑(y(t)). Similarly, it can be shown by induction on
the number of layers of B that the output sequence y(t) depends only on the particular values in each entry of x(w) (this is
in contrast to a single balancer whose output depends only on the sum of the input values). Thus, if any two executions the
number of tokens on each input wire are the same, then the corresponding values in y(t) are the same.
We want to note that some of the notation will be abused in the paper. For example, as we saw above, x is used to
represent both input sequences of balancers and balancing networks. As we will see below, ywill also be used to represent
input sequences of balancing networks. Further, sometimes sequences will be used to denote network wires. Every time it
will be clear from the context what purpose a sequence serves.
Wewill consider the following balancing network families, which are described according to their behavior in a quiescent
state:
• Counting network: For any values in the input sequence, the output sequence satisfies the step property.
• k-Smoothing network: For any values in the input sequence, the output sequence satisfies the k-smooth property.
• Difference merging network: Suppose that the input width of the network isw. Let u(w) be the input sequence. The input
sequenceu(w) is decomposed into two sequences x(w/2) (first input sequence) and y(w/2) (second input sequence), consisting
of the first and second half, respectively, of u(w). There is a merging parameter δ ≥ 1 that specifies the behavior of the
network. If both x(w/2) and y(w/2) satisfy the step property, and 0 ≤∑(x(w/2))−∑(y(w/2)) ≤ δ, then the output sequence
satisfies the step property too. That is, a differencemerging networkmerges two step input sequences into a unique step
output sequence, if the sums of the input sequences differ by at most δ.
The following result states that in a regular balancing network, if the input to a layer is k-smooth, then the output of the
layer is k-smooth too, which implies that every subsequent layer is k-smooth too.
Lemma 2.5. Consider a regular balancing networkB of widthw. Let ` denote a layer ofB . If the input sequence to ` is k-smooth,
then the output sequence of ` is k-smooth too.
Proof. Suppose that the input sequence to ` is k-smooth. Thus, the values on the input sequence are in the range [a, a+ k],
for some a ≥ 0. The sum-preservation property of balancers implies that for any regular balancer themaximum (minimum)
value of its output sequence never exceeds (falls below) the maximum (minimum) value of its input sequence. Therefore,
the output sequence of each balancer in `will take values in the range [a, a+ k], which implies that the output sequence of
` is k-smooth. 
3010 C. Busch, M. Mavronicolas / Theoretical Computer Science 411 (2010) 3001–3030
Fig. 4. Two isomorphic balancing networks.
2.3. Isomorphic balancing networks
We discuss the isomorphism of balancing networks which will be useful in the contention analysis of our counting
network.
First, we consider permutations. Consider the set H = {0, . . . , w − 1}. A permutation on H is a correspondence (one to
one and onto function) pi : H → H , that maps each element of H to another element of H . We define the permutation of a
sequence x(w) to be pi(x(w)) = y(w) such that xi = ypi(i), for each i ∈ H . Since permutation pi is a correspondence, it has an
inverse permutation, denoted piR, such that piR(pi(i)) = i. Note that if pi(x(w)) = y(w), then x(w) = piR(y(w)).
The next result establishes that the permutation of a smooth sequence is also smooth.
Lemma 2.6. If a sequence x(w) is k-smooth and pi is a permutation, then pi(x(w)) is k-smooth.
Proof. Let pi(x(w)) = y(w). We have that for any pair of elements yi and yj of y(w), xpiR(i) = yi and xpiR(j) = yj. Since|xpiR(i) − xpiR(j)| ≤ k, we get |yi − yj| ≤ k. 
Consider twobalancing networksB andB ′which both have inputwidthw, and outputwidth t .We say that the networks
B andB ′ are isomorphic if two conditions hold (see also Fig. 4):
i. There is a correspondence between the balancers of B and B ′, where any (p, q)-balancer b in B has a corresponding
(p, q)-balancer b′ inB ′.
ii. For any balancer bi inB whose k-th output wire is connected to an input wire of a balancer bj, it holds that inB ′ the k-th
output wire of balancer b′i is connected to some input wire of balancer b
′
j (the input wire in bj may not be necessarily the
same as the input wire in b′j).
Wewould like to note that this definition of isomorphism is different from graph isomorphism, since in balancing networks
the outgoing wires have an order which would be lost if we represented each balancer as a node in a graph and the wires
as adjacent edges.
Conditions i and ii imply that there is a correspondence between the input wires of B and B ′. Similarly there is a
correspondence between the output wires of B and B ′. Let x(w) and y(t) be the respective input and output sequence of
B, and let u(w) and z(t) be the respective input and output sequence ofB ′. Let piin be the correspondence (permutation) that
maps input wires ofB to input wires ofB ′. Similarly, let piout be the correspondence (permutation) that maps output wires
of B to output wires of B ′. If token τ enters on input wire j in B, then τ enters on wire piin(j) in B ′. Similarly, if token τ
exits on output wire j in B, then τ exits on wire piout(j) in B ′. Therefore, any execution E = α(τi1 , bi1), . . . , α(τik , bik) in
B with tokens {τ1, . . . , τm}, has a corresponding execution E ′ = α(τ ′i1 , b′i1), . . . , α(τ ′ik , b′ik) in B ′ with tokens {τ ′1, . . . , τ ′m},
where token τi corresponds to token τ ′i . At the end of executions E and E ′, the number of tokens that have left the ith wire
of a balancer b inB is the same as the number of tokens that have left the ith wire of balancer b′ inB ′. Therefore, we obtain
the following lemma (note here that we abuse notation and we treat sequences as wires, and vice-versa):
Lemma 2.7. IfB andB ′ are isomorphic and in quiescent states with u(w) = piin(x(w)), then z(t) = piout(y(t)).
Next, we establish that isomorphic networks have the same smoothness properties.
Lemma 2.8. IfB andB ′ are isomorphic andB is k-smoothing, thenB ′ is k-smoothing.
Proof. Suppose that sequence u(w) takes arbitrary values. Then set the values on x(w) so that x(w) = piRin(u(w)). Since B is
k-smoothing, y(t) is k-smooth. From Lemma 2.6, piout(y(t)) is k-smooth too. From Lemma 2.7, piout(y(t)) = z(t). Therefore, z(t)
is k-smooth. 
3. Difference merging network
We present the construction of a difference merging network M(t, δ). The network is regular with width t; δ is the
merging parameter. The parameters t and δ are chosen so that t = p2i, δ = 2j, p ≥ 1, and 1 ≤ j < i (such assignments to
parameters t and δ will be called valid).
C. Busch, M. Mavronicolas / Theoretical Computer Science 411 (2010) 3001–3030 3011
Fig. 5. Top: the networkM(t, 2); Bottom: the recursive construction ofM(t, δ).
3.1. Construction ofM(t, δ)
We will give the construction of networkM(t, δ). To describe the network, we will use the input and output sequences
to refer to their corresponding wires in the network. In the construction, we will say that any two arbitrary sequences c(w)
and e(y) are directly-connected if ci is connected with a wire to ei.
Let u(t) and z(t) denote the input and output sequences, respectively, ofM(t, δ). Denote by x(t/2) (first input sequence)
and y(t/2) (second input sequence), the first and second half of u(t), respectively.
The construction ofM(t, δ) is recursive on the parameter δ; parameter t may take any valid value (see Fig. 5).
• Recursive basis, δ = 2. The networkM(t, 2), for any valid t , consists of a single layer of t/2 copies of the (2, 2)-balancer,
denoted b0, . . . , bt/2−1 (see top of Fig. 5). For 1 ≤ i < t/2, the first and second input wires of balancer bi are connected
to yi−1 and xi, respectively, and the first and second output wires are connected to z2i−1 and z2i, respectively. For balancer
b0, the first and second input wires are connected to x0 and yt/2−1, respectively, the first and second output wires are
connected to z0 and zt−1, respectively.
3012 C. Busch, M. Mavronicolas / Theoretical Computer Science 411 (2010) 3001–3030
Fig. 6. Left: the networkM(8, 4); Right: the networkM(16, 4).
• Recursive step, δ > 2. Suppose that we have constructed the networkM(t ′, δ/2), for any valid t ′. The networkM(t, δ),
for any valid t , is constructed in two sub-steps, as follows (see bottom of Fig. 5).
– Sub-step 1. Take two copies of the network M(t/2, δ/2), denoted M0(t/2, δ/2) and M1(t/2, δ/2). The first input
sequence ofM0(t/2, δ/2) is directly-connected to x
(t/4)
e (even subsequence of x(t/2)), while the second input sequence
is directly-connected to y(t/4)e (even subsequence of y(t/2)). The first input sequence of M1(t/2, δ/2) is directly-
connected to x(t/4)o (odd subsequence of x(t/2)), while the second input sequence is directly-connected to y
(t/4)
o (odd
subsequence of y(t/2)). Let g(t/2) and h(t/2) denote the output sequences of networksM0(t/2, δ/2) andM1(t/2, δ/2),
respectively.
– Sub-step 2. Take a copy of the network M(t, 2) which is given by the recursion basis. The first input sequence of
M(t, 2) is directly-connected to the output sequence g(t/2) ofM0(t/2, δ/2) and the second input sequence is directly-
connected to the output sequenceh(t/2) ofM1(t/2, δ/2). Finally, the output sequence ofM(t, 2) is directly-connected
to the output sequence z(t) ofM(t, δ).
As an example of the recursive construction, Fig. 6 depicts the networksM(8, 4) and C(16, 4). Next, we calculate the
depth of networkM(t, δ).
Lemma 3.1. depth(M(t, δ)) = lg δ.
Proof. By construction, depth(M(t, 2)) = 1. By the recursive construction ofM(t, δ),
depth (M(t, δ)) = depth
(
M
(
t
2
,
δ
2
))
+ depth (M(t, 2))
= depth
(
M
(
t
2
,
δ
2
))
+ 1
= lg δ. 
3.2. Correctness ofM(t, δ)
We prove the correctness ofM(t, δ). We start with the correctness ofM(t, 2).
Lemma 3.2. M(t, 2) is a difference merging network.
Proof. Suppose thatM(t, 2) is quiescent and each of the input sequences x(t/2) and y(t/2) satisfies the step property such
that 0 ≤∑(x(t/2))−∑(y(t/2)) ≤ 2. We will show that the output sequence z(t) satisfies the step property.
Denote by a and b the maximum values in sequences x(t/2) and y(t/2), respectively. Denote by k and l the respective step
points. Since t/2 ≥ 2, Lemma 2.2 implies that 0 ≤ a− b ≤ 1.
C. Busch, M. Mavronicolas / Theoretical Computer Science 411 (2010) 3001–3030 3013
(a)
(b)
(c)
Fig. 7. The case a = b and k < t/2.
The various possible values on the input and output sequences ofM(t, 2) are illustrated in Figs. 7–9. Each sequence is
represented with a linear array were an entry is darker for a higher value, and lighter for a lower value. In the figures, û (t)
denotes a permutation of u(t) where x(t/2) (the first half of u(t)) corresponds to the even subsequence of û (t), while y(t/2)
(the second half of u(t)) corresponds to the odd subsequence of û (t).
We consider the cases a = b and a = b+ 1, separately.
3014 C. Busch, M. Mavronicolas / Theoretical Computer Science 411 (2010) 3001–3030
(a)
(b)
(c)
Fig. 8. The case a = b and k = t/2.
• a = b. In this case, û (t) is 1-smooth. Since 0 ≤ ∑(x(t/2)) −∑(y(t/2)) ≤ 2, the step points k and l differ by at most 2,
which gives the following three possibilities.
– k = l. It holds 1 ≤ k ≤ t/2. The case k < t/2 is depicted in Fig. 7.a, where only balancers bk and b0 receive two
different input values. The case k = t/2 is depicted in Fig. 8.a, where all balancers receive the same input values. The
output sequence z(t) is step in both cases.
C. Busch, M. Mavronicolas / Theoretical Computer Science 411 (2010) 3001–3030 3015
(a)
(b)
(c)
Fig. 9. The case a = b+ 1.
– k = l+ 1. It holds 2 ≤ k ≤ t/2 and 1 ≤ l ≤ t/2− 1. The case k < t/2 is depicted in Fig. 7.b, while the case k = t/2
is depicted in Fig. 8.b. In both cases, only balancer b0 receives two different input values. The output sequence z(t) is
step in both cases.
– k = l+ 2. It holds 3 ≤ k ≤ t/2 and 1 ≤ l ≤ t/2− 2. The case k < t/2 is depicted in Fig. 7.c while the case k = t/2 is
depicted in Fig. 8.c. In both cases, only balancers bk−1 and b0 receive two different input values. The output sequence
z(t) is step in both cases.
3016 C. Busch, M. Mavronicolas / Theoretical Computer Science 411 (2010) 3001–3030
• a = b+ 1. If t/2 > 2 and k > 2, then we would have∑
(x(t/2)) > 2 · a+
(
t
2
− 2
)
(a− 1) = t
2
· b+ 2 ≥
∑
(y(t/2))+ 2,
which would give
∑
(x(t/2))−∑(y(t/2)) > 2, a contradiction. Therefore, it has to be that either k = 1 or k = 2. If k = 1
and l < t/2− 1 then we would have∑
(x(t/2)) = 1 · a+
(
t
2
− 1
)
(a− 1) = t
2
· b+ 1 >
∑
(y(t/2))+ 2,
which would give
∑
(x(t/2))−∑(y(t/2)) > 2, a contradiction. Thus, if k = 1, then it has to be l ≥ t/2− 1. Similarly, we
obtain that if k = 2 then it has to be l = t/2. Consequently, we only need to examine the following three possibilities.
– k = 1 and l = t/2 − 1. This case is depicted in Fig. 9.a. Note that û (t) is 2-smooth, and balancer b0 receives input
values a and a− 2. The resulting sequence z(t) contains a− 1 in all of its entries. Thus, z(t) is step.
– k = 1 and l = t/2. This case is depicted in Fig. 9.b. Here, û (t) is 1-smooth. Only balancer b0 receives different input
values. The output sequence z(t) is step.
– k = 2 and l = t/2. This case is depicted in Fig. 9.c. Note that û (t) is again 1-smooth. Only balancers b0 and b1 receive
different input values. The output sequence z(t) is step.
So, in all cases the output sequence z(t) satisfies the step property, as needed. 
Next, we show the correctness of networkM(t, δ) for any δ.
Lemma 3.3. M(t, δ) is a difference merging network.
Proof. Suppose thatM(t, δ) is quiescent, and that each of the input sequences x(t/2) and y(t/2) satisfy the step property with
0 ≤ ∑(x(t/2)) −∑(y(t/2)) ≤ δ. We will show that the output sequence z(t) satisfies the step property. We will prove the
claim by induction on δ.
For the basis case δ = 2, Lemma 3.2 proves thatM(t, 2) is a difference merging network.
Consider now the case δ > 2, and suppose that the networkM(t ′, δ/2) is a difference merging network, for any valid
value of t ′ (induction hypothesis). We will show that the networkM(t, δ) is a difference merging network.
Since z(t) is the output sequence of M(t, 2), z(t) will be step if each of g(t/2) and h(t/2) are step and 0 ≤ ∑(g(t/2)) −∑
(h(t/2)) ≤ 2.
First we show that g(t/2) is step. By the induction hypothesis,M0(t/2, δ/2) is a differencemerging network. Thus, g(t/2) is
step if each of x(t/4)e and y
(t/4)
e are step and 0 ≤∑(x(t/4)e )−∑(y(t/4)e ) ≤ δ/2. Since each of x(t/2) and y(t/2) is step, Lemma 2.1
implies that that x(t/4)e and y
(t/4)
e are step too. Furthermore, since 0 ≤ ∑(x(t/2)) −∑(y(t/2)) ≤ δ, Lemma 2.4 implies that
0 ≤∑(x(t/4)e )−∑(y(t/4)e ) ≤ δ/2. Hence, the sequence g(t/2) is step.
In an exactly similar way we can prove that the sequence h(t/2) is step.
Now, we show that 0 ≤∑(g(t/2))−∑(h(t/2)) ≤ 2. By the sum preservation property of networksM0 andM1 we have∑(
g(
t
2 )
)
=
∑(
x(
t
4 )
e
)
+
∑(
y(
t
4 )
e
)
,
and ∑(
h(
t
2 )
)
=
∑(
x(
t
4 )
o
)
+
∑(
y(
t
4 )
o
)
.
From Lemma 2.3,
0 ≤
∑(
x(
t
4 )
e
)
−
∑(
x(
t
4 )
o
)
≤ 1,
and
0 ≤
∑(
y(
t
4 )
e
)
−
∑(
y(
t
4 )
o
)
≤ 1.
Adding these two inequalities we get
0 ≤
(∑(
x(
t
4 )
e
)
+
∑(
y(
t
4 )
e
))
−
(∑(
x(
t
4 )
o
)
+
∑(
y(
t
4 )
o
))
≤ 2,
which implies that
0 ≤
∑(
g(
t
2 )
)
−
∑(
h(
t
2 )
)
≤ 2.
Since each of g(t/2) and h(t/2) is step, and 0 ≤ ∑(g(t/2)) − ∑(h(t/2)) ≤ 2, the output sequence z(t) is also step, as
needed. 
C. Busch, M. Mavronicolas / Theoretical Computer Science 411 (2010) 3001–3030 3017
Fig. 10. The recursive construction of counting network C(w, t).
3.3. Comparison to the bitonic merging network
Themerging network used in the bitonic counting network [5] is constructed recursively in a similar way to our merging
networkM(t, δ). However, the two networks have the following significant differences:
• The depth of our merging network is lg δ, while the depth of the bitonic merging network is lg t .
• In the recursion basis of our construction we use networkM(t ′, 2), for some t ′, which is a layer of balancers, while the
basis of the recursion of the bitonic merging network uses a single balancer.
• In the bitonic merging network, the first mergerM0 takes as inputs the even and odd subsequences of x(t/2) and y(t/2),
respectively, while the secondmergerM1 takes as inputs the odd and even subsequences of x(t/2) and y(t/2), respectively.
In this way the output sequences ofM0 andM1 have differences in the range [−1, 1]. In contrast, in our construction
the output difference of the two mergers is in the range [0, 2]. As a consequence, we use layerM(t, 2) to combine the
two outputs, while the bitonic merging network uses a different output layer. This results in non-isomorphic counting
networks.
4. Counting network
We present the construction of a counting network C(w, t), which has input widthw = 2k and output width t = p ·w,
where p, k ≥ 1 (such assignments to parametersw and t will be called valid).
4.1. Construction of C(w, t)
In the construction of C(w, t), we will use as a building block the ladder networkL(w) (see Fig. 10). The networkL(w)
is a balancing network of input and output width w, that consists of a single layer of w/2 copies of the (2, 2)-balancer,
denoted b0, . . . , bw2 −1. Consider a balancer bi, where 0 ≤ i ≤ w/2 − 1. The top and bottom input wires of balancer bi are
connected to input wires i and i+w/2, respectively, ofL(w) (corresponds to elements i and i+w/2 of the input sequence
ofL(w)). The top and bottom output wires of balancer bi are connected to output wires i and i+w/2, respectively, ofL(w)
(corresponds to elements i and i+ w/2 of the output sequence ofL(w)).
The construction of C(w, t) is by recursion onw, where t takes arbitrary valid values.
• Recursive basis,w = 2. The network C(2, t) is just a (2, t)-balancer, for any valid t .
• Recursive step,w > 2. For the inductive case, suppose that we have constructed the network C(w/2, t ′), for any valid t ′.
We will construct the network C(w, t), for any valid t , in two sub-steps as follows (see Fig. 10).
3018 C. Busch, M. Mavronicolas / Theoretical Computer Science 411 (2010) 3001–3030
Fig. 11. On the left are the recursive constructions of networks C(4, 4) and C(4, 8), while on the right are the respective networks when the wires are
straightened.
– Sub-step 1. Let x(w) and y(t) denote the input and output sequences, respectively, of C(w, t). We take a copy of the
ladder network L(w), and two copies of C(w/2, t/2), denoted as C0(w/2, t/2) and C1(w/2, t/2), Let e(w/2) and
g(t/2) be the input and output sequences, respectively, of C0(w/2, t/2); let f (w/2) and h(t/2) be the input and output
sequences, respectively, ofC1(w/2, t/2). The input sequence x(w) is directly-connected to the input sequence ofL(w).
The input sequence e(w/2) is directly connected to the first half of the output sequence ofL(w), while input sequence
f (w/2) is directly-connected to the second half of the output sequence ofL(w).
– Sub-step 2. Take now a copy of the merging networkM(t, w/2) (described in Section 3). The first input sequence of
networkM(t, w/2) is directly-connected to the output sequence g(t/2), while the second input sequence ofM(t, w/2)
is directly-connected to h(t/2). The output sequence ofM(t, w/2) is directly-connected to the output sequence y(t) of
network C(w, t).
Figs. 11–13, depict the recursive constructions of networks C(4, 4), C(4, 8), C(8, 8), and C(8, 16). Each example
demonstrates how a ladder network, two smaller recursive counting networks, and a merging network can be combined in
order to obtain a larger counting network. In each figure we also show the respective networks that can be obtained when
the wires that connect the balancers are straightened throughout the network; this is a typical way to draw the networks.
As can be observed in the figures, the wire straightening causes the input wire order to be permuted, while the output
wires remain the same. Note however, that the input wire permutation does not affect the correct operation of the counting
network since the output token distribution in any quiescent state is only affected by the total number of tokens and not
where the tokens have entered from.
Next, we calculate the depth of network C(w, t).
Theorem 4.1. depth(C(w, t)) = (lg2w + lgw)/2.
Proof. By the recursive construction of C(w, t), depth(C(2, t)) = 1, and
depth (C(w, t)) = 1+ depth
(
C
(
w
2
,
t
2
))
+ depth
(
M
(
t,
w
2
))
= 1+ depth
(
C
(
w
2
,
t
2
))
+ lg w
2
(by Lemma 3.1)
= · · ·
C. Busch, M. Mavronicolas / Theoretical Computer Science 411 (2010) 3001–3030 3019
Fig. 12. On the top is the recursive construction of network C(8, 8), while on the bottom is the respective network when the wires are straightened; This
network uses as a component the network C(4, 4) (Fig. 11) and merging networkM(8, 4) (Fig. 6).
= k+ depth
(
C
(
w
2k
,
t
2k
))
+
k∑
i=1
lg
w
2i
= k+ depth
(
C
(
w
2k
,
t
2k
))
+ k lgw − k
2
2
− k
2
= · · ·
= lgw − 1+ depth
(
C
(
w
2lgw−1
,
t
2lgw−1
))
+ (lgw − 1) lgw − (lgw − 1)
2
2
− lgw − 1
2
= lg
2w + lgw
2
, (since depth
(
C
(
2, 2t
w
)) = depth (C (2, 2p)) = 1)
as needed. 
4.2. Correctness of C(w, t)
We now show the correctness of network C(w, t).
Theorem 4.2. C(w, t) is a counting network.
Proof. Suppose that C(w, t) is quiescent. We prove that for any values on the input sequence x(w), the output sequence y(t)
is step. We will prove the correctness of C(w, t) by induction onw.
For the basis casew = 2, the network C(2, t) is just a (2, t)-balancer, which is a counting network by definition.
3020 C. Busch, M. Mavronicolas / Theoretical Computer Science 411 (2010) 3001–3030
Fig. 13. On the top is the recursive construction of network C(8, 16), while on the bottom is the respective network when the wires are straightened; This
network uses as a component the network C(4, 8) (Fig. 11) and merging networkM(16, 4) (Fig. 6).
For w > 2, suppose that the network C(w/2, t ′) is counting for any valid t ′ (induction hypothesis). We will show that
the network C(w, t) is counting too, for any valid t .
Consider the construction of C(w, t). By the induction hypothesis, we have that the respective outputs g(t/2) and h(t/2)
of C0 and C1, are step. Since, by Lemma 3.3, M(t, w/2) is a difference merging network, the sequence y(t) is step if
0 ≤∑(g(t/2))−∑(h(t/2)) ≤ w/2. Next, we prove that this property holds.
C. Busch, M. Mavronicolas / Theoretical Computer Science 411 (2010) 3001–3030 3021
By the sum preservation property of networks C0 and C1,∑(
e(
w
2 )
)
=
∑(
g(
t
2 )
)
,
and ∑(
f (
w
2 )
)
=
∑(
h(
t
2 )
)
.
Thus, we only need to prove that 0 ≤∑(e(w/2))−∑(f (w/2)) ≤ w/2.
The sequences e(w/2) and f (w/2) are connected to the outputs of the (2, 2)-balancers b0, . . . bw/2−1 of the ladder network
L(w), so that the first output wire of bi is connected to ei and the second to fi, for all 0 ≤ i ≤ w/2− 1. Since the outputs of
balancer bi have the step property for any input sequence x(w), 0 ≤ ei − fi ≤ 1, for all i, where 0 ≤ i < w/2. By summing
these inequalities for all thew/2 balancers ofL(w), we obtain
0 ≤
(
e0 + · · · + ew2 −1
)
−
(
f0 + · · · + fw2 −1
)
≤ w
2
which implies
0 ≤
∑(
e(
w
2 )
)
−
∑(
f (
w
2 )
)
≤ w
2
.
Consequently, the sequence y(t) is step, as needed. 
5. Butterfly network
Herewe describe the butterfly network. Aswewill see in Section 6, the first lgw layers of networkC(w, t) are isomorphic
to the butterfly. We give two isomorphic descriptions of the butterfly, the forward-butterfly, and the backward-butterfly. It
is easy to show that the first layers of C(w, t) are isomorphic to a backward-butterfly, however, the contention analysis is
easier in the forward-butterfly.
5.1. Forward-butterfly
Here we describe the forward-butterfly, denotedD(w), which is a regular network of widthw = 2k, where 0 ≤ k.
NetworkD(w) is constructed recursively onw (see top of Fig. 14).
• Recursive basisw = 1.D(w) is simply a wire.
• Recursive step w > 1. Suppose we have constructed D(w/2). D(w) is constructed by taking two copies of D(w/2),
whichwe denoteD0(w/2) andD1(w/2), and the ladder networkL(w) (whichwas described in Section 4.1). The output
sequence of D0(w/2) is directly-connected to the first half of the input sequence of L(w), while the output sequence
of D1(w/2) is directly-connected to the second half of the input sequence of L(w). The input sequence of D(w) is
the concatenation of the input sequences of D0(w/2) and D1(w/2), while the output sequence of D(w) is the output
sequence ofL(w).
From the recursive construction of the forward-butterfly network, we immediately have:
Lemma 5.1. depth(D(w)) = lgw.
We next show that the forward-butterfly is lgw-smoothing.
Lemma 5.2. D(w) is lgw-smoothing.
Proof. Suppose thatD(w) is quiescent. The proof is by induction ofw.
For the basis case, wherew = 1, the network is a wire that trivially has the 0-smooth property.
Assume that the claim holds forw/2. We will show that the claim holds also forw.
Recall the recursive construction of D(w). By the induction hypothesis, each of D0(w/2) and D1(w/2) is lg(w/2)-
smoothing. Let c0 and d0 be theminimum andmaximum values, respectively, of the output sequence ofD0(w/2). Similarly,
define c1 and d1 forD1(w/2). From the lg(w/2)-smooth property we have d0− c0 ≤ lg(w/2), and d1− c1 ≤ lg(w/2). Thus,
d0 + d1 ≤ c0 + c1 + 2 lg(w/2).
Each balancer bi at L(w) receives one input from the outputs of D0(w/2) and the other input from the outputs of
D1(w/2). An arbitrary balancer bi will output the minimum value if it receives c0 and c1 on its inputs. In this case, the
minimum value on the outputs of the balancer will appear on the bottom output wire and is given by c = d(c0+ c1−1)/2e.
In a similar way we can show that the maximum value on a balancer is d = d(d0 + d1)/2e. Hence,
d =
⌈
d0 + d1
2
⌉
≤
⌈
c0 + c1 + 2 lg w2
2
⌉
=
⌈
c0 + c1
2
⌉
+ lg w
2
.
3022 C. Busch, M. Mavronicolas / Theoretical Computer Science 411 (2010) 3001–3030
Fig. 14. Top: the forward-butterflyD(w), and instanceD(8); Bottom: the backward-butterfly E(w), and instance E(8).
Clearly,⌈
c0 + c1
2
⌉
−
⌈
c0 + c1 − 1
2
⌉
≤ 1.
which implies that
c ≥
⌈
c0 + c1
2
⌉
− 1.
Therefore,
d− c ≤
⌈
c0 + c1
2
⌉
+ lg w
2
−
⌈
c0 + c1
2
⌉
+ 1 = lg w
2
+ 1 = lgw,
as needed. 
5.2. Backward-butterfly
Here, we describe the backward-butterfly network, whose construction is similar to the forward-butterfly network, with
the only difference that the ladder network appears in the front.
C. Busch, M. Mavronicolas / Theoretical Computer Science 411 (2010) 3001–3030 3023
A backward-butterfly network E(w) is a regular network of width w, where w = 2k and 0 ≤ k. The network is
constructed recursively onw (see bottom of Fig. 14).
• Recursion basis,w = 1. E(w) is simply a wire.
• Recursion step,w > 2. Assume that we have constructed E(w/2). The network E(w) is constructed by taking two copies
of E(w/2), which we denote E0(w/2) and E1(w/2), and the ladder network L(w). The input sequence of E(w) is the
input sequence ofL(w). The input sequence of E0(w/2) is directly-connected to the first half of the output sequence of
L(w), while the input sequence of E1(w/2) is directly-connected to the second half of the output sequence ofL(w). The
first half and second half of the output sequence of E(w) is directly connected to the output sequences of E0(w/2) and
E1(w/2), respectively.
Next, we show that the backward-butterfly and the forward-butterfly are isomorphic.
Lemma 5.3. The backward-butterfly E(w) is isomorphic to the forward-butterflyD(w).
Proof. We will prove the claim by induction on w. For the induction basis, we consider the cases w = 1 and w = 2. For
w = 1, E(w) is simply a wire which is trivially a forward-butterfly. For w = 2, E(w) is a (2, 2)-balancer, which is trivially
a forward-butterfly.
Suppose now that the claim holds for allw′ such that 2 ≤ w′ < w. We will show that the claim holds forw, that is, E(w)
andD(w) are isomorphic.
We will prove the claim by transforming E(w) toD(w). The transformation consists of several steps with intermediate
isomorphic networks where the last step isD(w). The transformation is depicted in Fig. 15.
By construction,E(w) consists of two identical backward-butterfliesE0(w/2) andE1(w/2), and the ladder networkL(w)
(Fig. 15.a). By the induction hypothesis, E0(w/2) is isomorphic to forward-butterfly D0(w/2). As illustrated in Fig. 15.b,
in the construction of E(w), we replace E0 with D0, by the addition of an appropriate input permutation piin and output
permutation piout. Similarly E1(w/2) is substituted withD1(w/2).
Input wire i of D0 corresponds to input wire j = piRin(i) of E0. Similarly, input wire i of D1 corresponds to input wire
j = piRin(i) of E1. Let bj be the balancer of L(w) that connects the jth inputs of E0 and E1. We have that bj connects the ith
input wires ofD0 andD1. Suppose thatL(w) consists of balancers b0, . . . , bw/2−1. LetL′(w) be a new ladder network with
balancers b′0, . . . , b
′
w/2−1, which is isomorphic to L(w) so that balancer bj corresponds to balancer b
′
i , where i = piin(j).
Replace now the input permutation and the networkL(w), with networkL′(w), and remove also the output permutations
piout (Fig. 15.c). The new network is isomorphic to E(w).
Since w/2 > 1, from the recursive construction of forward-butterflies, we have thatD0(w/2) consists of two copies of
D(w/4), denotedD ′0(w/4) andD
′
1(w/4), whose outputs are connectedwith a copy of the ladder networkL(w/2), denoted
G0(w/2) (Fig. 15.d). Similarly,D1(w/2) consists of two forward-butterflies,D ′2(w/4) andD
′
3(w/4), and a ladder network
G1(w/2).
Next, we exchange the positions ofD ′2 andD
′
3, as shown in Fig. 15.e. This exchange results in the transformation of ladder
network L′(w) to two canonical layers L′′0(w/2) and L
′′
1(w/2), where L
′
0 connects the inputs of D
′
0(w/4) and D
′
2(w/4),
while layer L′′1 connects the inputs of D
′
1(w/4) and D
′
3(w/4). Further, the exchange of D
′
2(w/4) and D
′
3(w/4) results in
the combination of G0(w/2) and G1(w/2) to a new ladder network of width w that we denote as G′(w). The whole new
network is isomorphic to E(w).
By the induction hypothesis, D ′0(w/4), . . . ,D
′
4(w/4) are isomorphic to some backward-butterflies, say E
′
0(w/4), . . . ,
E ′0(w/4), respectively. We replace each D
′
i with the respective E
′
i , using appropriate input and output permutations
(Fig. 15.f). After we remove the input and output permutations in a similar way as described in one of the previous
paragraphs, we obtain the network of Fig. 15.g, where input ladder networks L′′0 and L
′′
1 are translated to L
′′′
0 and L
′′′
1 ,
respectively, and the output ladder network G′ is translated to G′′.
The combination of L′′′0 , E
′
0, and E
′
2 forms a backward-butterfly that we denote as E
′′
0 (w/2) (Fig. 15.g). Similarly, the
combination of L′′′1 , E
′
1, and E
′
3 form a backward-butterfly E
′′
1 (w/2). By the induction hypothesis, E
′′
0 (w/2) and E
′′
1 (w/2)
are isomorphic to some forward-butterflies, say D ′′0 (w/2) and D
′′
1 (w/2), respectively. We replace E
′′
0 with D
′′
0 and E
′′
1
with D ′′1 , respectively, with appropriate input and output permutations (Fig. 15.h). After we remove the input and output
permutations, we obtain a new output ladder network G′′′(w) (Fig. 15.i). The resulting network is a forward-butterflyD(w)
of widthw which is isomorphic to E(w). 
6. Contention analysis
Here, we give the contention analysis of our counting network C(w, t). We first give some necessary preliminaries and
a methodology to analyze the contention of a layer. We then continue to analyze the contention of the butterfly network
which will help to compute the contention of C(w, t).
3024 C. Busch, M. Mavronicolas / Theoretical Computer Science 411 (2010) 3001–3030
Fig. 15. A series of transformations to prove that backward-butterflies are isomorphic to forward-butterflies.
6.1. Preliminaries on contention
As we discussed in Section 1.2, the contention in a balancing network B is measured through stalls of tokens. Given m
tokens issued by n concurrent processes there are many possible executions involving the tokens. The contention measure
cont(B, n,m) denotes themaximum total number of stalls, over all possible executions, induced by an adversary scheduler.
The amortized contention cont(B, n) (the limit supremum of cont(B, n,m) divided by m, as m goes to infinity) expresses
the worst number of stalls experienced by an average token in any execution. A useful property on contention that we will
use is that if we increase the concurrency n the contention does not decrease, since in an execution the additional processes
may always choose to be inactive. Thus,
C. Busch, M. Mavronicolas / Theoretical Computer Science 411 (2010) 3001–3030 3025
Observation 6.1. For any n′ > n, cont(B, n′,m) ≥ cont(B, n,m), and cont(B, n′) ≥ cont(B, n).
In the analysis of the contention of a networkB it is sometimes convenient to examine the structure ofB and estimate
the contention of its components separately. However, we cannot simply measure the contention of each of its components
and then sum it. The reason is that when we measure the contention of B we assume that the processes are distributed
uniformly on the input wires. This assumption may not hold for a ‘‘suffix’’ subnetworkB ′ ofB which does not include any
balancers from the first layer of B. The balancers of the first and subsequent layers of B may distribute in a nonuniform
way the processes among the inputs ofB ′. Thus, any independent analysis of the contention inB ′ will bemeaningless when
we estimate the whole contention of B. (Note that this problem does not occur in ‘‘prefix’’ subnetworks, whose first layer
consist of balancers from the first layer ofB, since the processes are distributed uniformly on the inputs.)
To overcome this difficulty we analyze the contention of the constituents ofB in executions ofB. Let ` be a layer ofB.
We use the notion of layer contention where we estimate the contention of ` in executions on B, rather than treating ` as
a separate network. The layer contention, denoted cont(B, `, n,m), is the maximum total number of stalls experienced in
layer ` bym tokens issued by nprocesses over all possible executions inB, induced by an adversary scheduler. The amortized
contention cont(B, `, n) is the limit supremum of cont(B, `, n,m) divided bym, asm goes to infinity. It is easy to see that
the contention ofB does not exceed the sum of the contentions of its layers, since the total number of stalls experienced by
a token is the sum of the stalls experienced in each layer of the network. Thus, we have:
Observation 6.2. IfB consists of layers `1, . . . , `m, then cont(B, n) ≤∑mi=1 cont(B, `i, n).
The definition of layer contention can be extended to any subnetwork of B. For example, if B consists of the cascade of
two networks B1 and B2, then cont(B, n) ≤ cont(B,B1, n) + cont(B,B2, n). (Note that if B1 is a prefix of B, then
cont(B,B1, n) = cont(B1, n).) In a similar way we can also define the balancer contention cont(B, b, n) for any balancer
b.
Consider now the balancers of the last layer ofB. Let b be a balancer in the last layer and suppose that b has output width
q. Suppose that we change the output width of b to q′ ≥ 2 and letB ′ be the respective network. This change does not affect
the contention of b, since b handles the tokens in both cases in the sameway by implementing atomic Fetch&Increment type
of operations for the tokens that traverse it, which causes serialization stalls that are not affected by the output width of the
balancer. In other words, every execution in the original network B has a corresponding execution with exactly the same
contention inB ′ and vice-versa. Thus, we have:
Observation 6.3. Let b be a balancer in the last layer of network B and suppose that b has output width q. Suppose that we
change the output width of b to q′ ≥ 2 and letB ′ be the respective network. Then, cont(B, b, n) = cont(B ′, b, n).
6.2. Methodology for layer contention analysis
We derive a general formula for computing the amortized contention of any layer of a balancing network. Parts of the
following discussion are adapted from [12, Section 3.2].
Let ` be a layer of balancing network B. Assume throughout this subsection that layer ` is made of balancers of output
width at most q. Denote by w the output width of `. Assume also throughout this subsection that in any quiescent state of
B, the output of ` has the k-smooth property. (In [12], it is assumed that ` is 1-smooth). Recall that the concurrency ofB be
n. Dwork et al. [12] have introduced the following methodology for analyzing and bounding the amortized contention of `:
1. Partition the set of tokens that traverse ` into groups of sizew, called generations.
2. Estimate the total number of stalls that occur in the tokens, by examining the average stalls that occur between tokens
of different generations.
3. To obtain the amortized contention, divide the result by the size of the generations,w.
We will show that as a group, each generation of tokens at layer ` causes O(qn+ qkw) stalls to other tokens. It then follows
that, on average, a generation receivesO(qn+qkw) stalls, since in the amortized analysis the stalls are distributed uniformly
among the tokens. Dividing by the number of tokens in a generation, it follows that the average token passing through `
receives (or causes) O(qn/w + qk) stalls.
We now continue with the details of the formal proof. Let b be a balancer of `with output width r , where 2 ≤ r ≤ q. We
say that a token belongs to the gth generation of tokens arriving at b if it is one of the ((g − 1)r + 1)th, . . . , ((g − 1)r + r)th
tokens to arrive at b. Note that each generation of b has r tokens. The gth generation of layer ` is the set of gth generation
tokens of the balancers at layer `. We say that by time t , the gth generation has completed its arrival at ` if for each balancer
in `, all the tokens of the gth generation have already entered the layer by that time. We say that at time t there are f tokens
of the gth generation missing at layer ` if by time t exactly w − f tokens of generation g have arrived at `. We say that a
token is stuck in `, if the token has entered a balancer in ` but not exited.
The following two lemmas are adaptations of [12, Fact 1] and [12, Fact 2] for the case where ` has the k-smooth property
(instead of the 1-smooth property which is considered in [12]).
Lemma 6.1. Suppose that layer ` is in a quiescent state. Let g be the maximum generation (up to the quiescent state) such that
some balancer b in ` has received at least one token of generation g. Then all balancers in ` have received at least one token of
generation g − k.
3026 C. Busch, M. Mavronicolas / Theoretical Computer Science 411 (2010) 3001–3030
Proof. Since balancer b has received a generation g token, the maximum value in the output sequence of b is at least g .
Assume, by contradiction, that there is a balancer b′ that has received no generation g − k token. The maximum value on
the output sequence of b′ is at most g − k − 1. So, there is an output wire of b and an output wire of b′ with difference at
least g − (g − k− 1) = k+ 1. This is a contradiction, since the output sequence of ` is k-smooth. 
Lemma 6.2. Let t be the time at which the first g generation token arrives at `. Then the number of tokens of generations strictly
less than g − k stuck at ` plus the number of tokens of generations strictly less than g − k still missing from layer ` is at most n.
Proof. Run the network B to quiescence from its state at time t . Let g ′ be the maximum generation (up to the quiescent
state) such that some balancer in layer ` has received at least one generation g ′ token. Clearly g ′ ≥ g . By Lemma 6.1, every
balancer has received at least one token from generation g ′− k ≥ g − k. Thus, the claim follows since at most n tokens (the
maximum number of tokens inB at any time) were involved in movingB to a quiescent state. 
Recall that when a token passes through a balancer it causes stalls to all tokens that are waiting at the balancer. By stalls
caused at layer ` by generation g to generation g ′, where g ′ ≥ g , we refer to stalls incurred to tokens of generation g ′ when
they are waiting at some balancer of layer ` and some token of generation g passes. The following result is an adaptation
of [12, Lemma 3.2.4] for the case where ` has the k-smooth property (instead of the 1-smooth property that is considered
in [12]).
Lemma 6.3. Consider the g generation passing through layer `. The maximum number of stalls caused to generation g by
generations less than or equal to g at this layer is at most qn+ q(k+ 1)w.
Proof. Suppose that the first token of generation g arrives at ` at time t . The generation g token can appear at the same
time in a balancer of `with tokens of the following two types:
(1) tokens of generation strictly less than g − k,
(2) or tokens of generation g − k, . . . , g .
From Lemma 6.2, the total number of tokens of generation strictly less than g − k stuck at ` or missing from ` is at most n.
Therefore, the type (1) tokens are at most n. The tokens of type (2) are at most (k+ 1)w, since each generation hasw tokens.
Each token of generation less than or equal to g encounters atmost q tokens of generation g (since q is themaximumbalancer
width in `). Therefore, the tokens of generation g experience at most qn stalls from tokens of type (1), and q(k+ 1)w stalls
from tokens of type (2), for a total of qn+ q(k+ 1)w stalls. 
Since there are w tokens at any generation g , Lemma 6.3 implies that the amortized contention endured by a token at
layer ` is at most (qn+ q(k+ 1)w)/w. Hence:
Corollary 6.4. The amortized contention of layer ` is:
cont(B, `, n) ≤ qn
w
+ q(k+ 1).
6.3. Contention of butterfly
We now compute the contention of the forward-butterfly network, which will help to compute the contention of our
counting network.
Lemma 6.5. The amortized contention of the forward-butterfly networkD(w) is
cont(D(w), n) <
4n lgw
w
+ lg2w + lgw.
Proof. We have that w is a power of 2. Let n′ be the smallest power of 2 which is greater or equal to n. Note that n′ < 2n.
Assume now that the concurrency is n′. We examine the cases n′ ≥ w and n′ < w.
First, consider the case n′ ≥ w. From the construction of the network D(w) (Section 5.2) we have that a token first
traverses either the subnetworkD0(w/2) or the subnetworkD1(w/2) and then the layerL(w). The concurrency of each of
D0 andD1 is at most n′/2, since the distributed system assigns each process to a particular input wire so that the processes
are uniformly distributed on the inputwires. The same argument applies recursively, since there are no balancers in between
the inputs ofD(w) and the inputs ofD0 andD1. The concurrency of layerL(w) is n′, since all processes traverse this layer.
Therefore, the amortized contention incurred by any token inD(w) is:
cont(D(w), n′) ≤ cont
(
D
(w
2
)
,
n′
2
)
+ cont(D(w),L(w), n′).
C. Busch, M. Mavronicolas / Theoretical Computer Science 411 (2010) 3001–3030 3027
By Lemma 5.2, the output sequence of layer L(w) has the lgw-smooth property. Since layer L(w) is made of (2, 2)-
balancers, and the width of the layer isw, from Corollary 6.4 we obtain:
cont(D(w),L(w), n′) ≤ 2n
′
w
+ 2(lgw + 1) = 2
(
n′
w
+ lgw + 1
)
.
For the case where w = 1, the networkD(1) is just a wire which trivially has amortized contention equal to zero. Denote
k = lgw. By applying Lemma 5.2 repeatedly we obtain:
cont
(
D(w), n′
) ≤ cont(D (w
2
)
,
n′
2
)
+ cont (D(w),L(w), n′)
≤ cont
(
D
(w
2
)
,
n′
2
)
+ 2
(
n′
w
+ k+ 1
)
≤ cont
(
D
(w
22
)
,
n′
22
)
+ cont
(
D
(w
2
)
,L
(w
2
)
,
n′
2
)
+ 2
(
n′
w
+ k+ 1
)
≤ cont
(
D
(w
22
)
,
n′
22
)
+ 2
(
n′
2
w
2
+ (k− 1)+ 1
)
+ 2
(
n′
w
+ k+ 1
)
= cont
(
D
(w
22
)
,
n′
22
)
+ 2
(
2
n′
w
+ 2k− 1+ 2
)
· · ·
≤ cont
(
D
(w
2j
)
,
n′
2j
)
+ 2
(
j
n′
w
+ jk−
j−1∑
i=1
i+ j
)
= cont
(
D
(w
2j
)
,
n′
2j
)
+ 2
(
j
n′
w
+ jk− j
2 + j
2
)
· · ·
≤ cont
(
D
(w
2k
)
,
n′
2k
)
+ 2
(
k
n′
w
+ k2 − k
2 + k
2
)
= 0+ 2
(
k
n′
w
+ k2 − k
2 + k
2
)
(since cont
(
D(1), n
′
w
)
= 0)
= 2kn
′
w
+ k2 − k
= 2n
′ lgw
w
+ lg2w − lgw.
If n′ < w then we can consider the contention n′′ = w, and we have that
cont
(
D(w), n′
) ≤ cont (D(w), n′′)
≤ 2n
′′ lgw
w
+ lg2w − lgw
= 2 lgw + lg2w − lgw
= lg2w + lgw.
When we combine the results of both cases n′ ≥ w and n′ < w we obtain:
cont
(
D(w), n′
) ≤ 2n′ lgw
w
+ lg2w + lgw.
Therefore, from Observation 6.1, and the fact that n′ < 2n, we obtain:
cont (D(w), n) ≤ cont (D(w), n′)
≤ 2n
′ lgw
w
+ lg2w + lgw
<
4n lgw
w
+ lg2w + lgw. 
6.4. Contention of counting network
Here, we compute the amortized contention of the counting network C(w, t) (described in Section 4.1). Recall from
Section 1.3.2, that the unfolded construction of the networkC(w, t) consists of blocksNa,Nb andNc (Fig. 3). LetNa,b denote
3028 C. Busch, M. Mavronicolas / Theoretical Computer Science 411 (2010) 3001–3030
Fig. 16. Left: the networkC ′(w, t)which consists of the first logw layers ofC(w, t); Right: the networkC ′′(w) inwhich the last layer ofC ′(w, t) is replaced
with (2, 2)-balancers.
the cascade of Na and Nb. We will show that a simple variation of Na,b is isomorphic to a forward-butterfly. This will help
to compute the contention ofNa,b andNc , which will give the contention of C(w, t).
Let C ′(w, t) be the network C(w, t) without the difference-merging subnetworks in the recursive construction of
C(w, t), as shown in the left part of Fig. 16. The input width of C ′(w, t) is w while its output width t . The construction
of C ′(w, t) resembles the recursive construction of a backward-butterfly E(w). The only difference is that at the basis of the
recursive construction, C ′(2, 2t/w) is a (2, 2p)-balancer (recall that t = p ·w, where p ≥ 1), instead of a (2, 2)-balancer in
E(w). Thus, all layers of C ′(w, t), except for the last, consist of (2, 2)-balancers. Clearly, the depth of C ′(w, t) is lgw.
The network C ′(w, t) describes how the balancers of the first lgw layers of C(w, t) are connected. SinceNa,b consists of
the first lgw layers of C(w, t), we have thatNa,b is exactly the same as C ′(w, t). We obtain the following result.
Lemma 6.6. Na,b is s-smoothing, where,
s =
⌊
w lgw
t
⌋
+ 2.
Proof. Let C ′′(w) denote the network that we obtain if we replace each (2, 2p)-balancer in the last layer of C ′(w, t) with
a (2, 2)-balancer. The recursive construction of C ′′(w) is depicted in the right part of Fig. 16. Clearly, C ′′(w) is a backward-
butterfly. From Lemma 5.3, the backward-butterfly is isomorphic to the forward-butterfly. Therefore, from Lemmas 2.8 and
5.2, C ′′(w) is logw-smoothing.
Let b0, . . . , bw/2−1 be the (2, 2)-balancers in the last layer of C ′′(w). Let x(2)i denote the output sequence of balancer bi.
Since C ′′(w) is lgw-smoothing, |∑(x(2)i )−∑(x(2)j )| ≤ 2 lgw, for any indices i and j such that 0 ≤ i, j < w/2. The factor 2
in front of the term 2 lgw comes from the fact each (2, 2)-balancer has two output wires.
Now, restore the (2, 2p)-balancers to the last layer ofC ′′(w), in order to obtain networkC ′(w, t). Denote by b̂0, . . . , b̂w2 −1
these balancers and by x̂ (2p)0 , . . . , x̂
(2p)
w
2 −1 their respective output sequences. Since for any balancer bˆi the only difference
from bi is the number of output wires, the total sum of tokens that leave any balancer in both cases is the same. That is,∑
(x(2)i ) =
∑
(̂x (2p)i ). Hence, |
∑
(̂x (2p)i )−
∑
(̂x (2p)j )| ≤ 2 lgw for any two balancers bˆi and bˆj.
From Lemma 2.2, the maximum values on any two output wires, one wire from balancer b̂i and the other from balancer
b̂j, will differ by at most b2 lgw/(2p)c + 1 = blgw/pc + 1. So, the maximum difference between any two output wires is
blgw/pc + 2 = bw lgw/tc + 2 = s, since p = t/w. Consequently, C ′(w, t) is s-smoothing.
Since Na,b is the same with C ′(w, t), Lemma 2.8 implies that Na,b is s-smoothing. 
We are now ready to prove an upper bound on the amortized contention of the network C(w, t).
Theorem 6.7. The amortized contention of network C(w, t) is
cont (C(w, t), n) <
4n lgw
w
+ n lg
2w
t
+ w lg
3w
t
+ 4 lg2w + lgw.
Proof. Since network C(w, t) is the cascade of networksNa,b andNc , and sinceNa,b is a prefix of C(w, t), we have:
cont (C(w, t), n) ≤ cont (C(w, t),Na,b, n)+ cont (C(w, t),Nc, n)
= cont (Na,b, n)+ cont (C(w, t),Nc, n) .
C. Busch, M. Mavronicolas / Theoretical Computer Science 411 (2010) 3001–3030 3029
Recall from the proof of Lemma 6.6, that the networkC ′′(w) is isomorphic to the forward-butterfly network E(w), which
implies that their amortized contention is the same, since any execution in one network has a corresponding execution in
the other network. Thus, from Lemma 6.5,
cont
(
C ′′(w), n
)
<
4n lgw
w
+ lg2w + lgw.
From Observation 6.3, this contention remains the same even when we restore the original (2, 2p)-balancers, namely,
cont(C ′(w, t), n) = cont(C ′′(w), n). Since Na,b and C ′(w, t) are the same, we have cont(Na,b, n) = cont(C ′(w, t), n) =
cont(C ′′(w), n).
Now, consider block Nc . The concurrency for every layer of Nc is n. From Lemma 6.6, Na,b is s-smoothing. Lemma 2.5
implies that the output of each layer ofNc will be s-smooth. SinceNc is regular with width t consisting of (2, 2)-balancers,
Corollary 6.4 implies that each layer ofNc has amortized contention at most 2n/t+2(s+1). From Observation 6.2, the total
amortized contention of block Nc is bounded by the contention of a layer multiplied by the number of layers in Nc . From
Theorem 4.1,
depth (Nc) = depth (C(w, t))− lgw = lg
2w − lgw
2
.
Therefore,
cont (C(w, t),Nc, n) ≤
(
2
n
t
+ 2(s+ 1)
) lg2w − lgw
2
≤ n lg
2w
t
+ s lg2w + lg2w
≤ n lg
2w
t
+ w lg
3w
t
+ 3 lg2w.
By adding the amortized contentions for blocksNa,b andNc we obtain:
cont (C(w, t), n) ≤ cont (Na,b, n)+ cont (C(w, t),Nc, n)
<
4n lgw
w
+ lg2w + lgw + n lg
2w
t
+ w lg
3w
t
+ 3 lg2w
= 4n lgw
w
+ n lg
2w
t
+ w lg
3w
t
+ 4 lg2w + lgw,
as needed. 
7. Discussion
We presented a counting network construction withw input wires and t output wires, wherew = 2k and t = p ·w, for
any p, k ≥ 1. This is one of a very few known irregular counting networks constructions [3,26], whose output width may
be different from its input width. We showed that the irregularity can benefit the amortized contention, by bringing it to
lower levels than other networks.
As a byproduct of our analysis, we obtain a novel sorting network construction. It is known that from any regular balancing
network, we can obtain a comparator network if we substitute each balancer by a comparator [5]. If the original balancing
network is a counting network, then the corresponding comparator network is a sorting network [5] (for more information
on sorting networks see [23]). Our counting network C(w,w), where the input width is the same with the output width,
gives a novel sorting network with depth O(lg2w).
Several interesting questions remain. Is it possible to extend our construction to arbitrary input and output widths, other
than multiples of a power of two? It follows from impossibility results in [1,10] that appropriate sets of balancer types
would have to be used for such extension. Using such larger balancers is often expected to cause a reduction in depth (see
[7,9,14,15]). What would be a trade-off between depth and contention in this situation? Can the combinatorial techniques
in [10] be used to show impossibility results on constructible widths for difference merging networks? We believe that
the difference merging network we presented is of independent interest and could be used for other counting and sorting
network constructions.
It would be interesting to examine if randomization can help to improve the depth of our network. In a particular kind of
randomization, a randomized balancer chooses a random initial state. Using randomization in the first layers wemay obtain
smaller bound on δ, the output difference of the recursively constructed counting networks. This may give smaller depth
networks. Randomization in counting networks has been studied in [3,17,24]. Another interesting extension would be to
examine if our counting network could have an adaptive implementation in a distributed system, similar to the adaptive
bitonic network given in [27]. In the adaptive scenario, the size of the network, or parts of the network, adapt to the load
3030 C. Busch, M. Mavronicolas / Theoretical Computer Science 411 (2010) 3001–3030
of the system in order to decrease the contention of the counting network. It would also be interesting to examine the
self-stabilizing properties of our network [18].
Acknowledgements
We are indebted to the anonymous referee for providing very useful comments and corrections that helped to improve
the paper significantly. We are also indebted to Maurice Herlihy for providing invaluable comments on an earlier version of
this paper. The second author was supported by funds for the promotion of research at University of Cyprus.
References
[1] E. Aharonson, H. Attiya, Counting networks with arbitrary fan-out, Distributed Computing 8 (1995) 163–169.
[2] W. Aiello, C. Busch,M. Herlihy,M.Mavronicolas, N. Shavit, D. Touitou, Supporting increment and decrement operations in balancing networks, Chicago
Journal of Theoretical Computer Science 2000 (December) 2000. Article 4. Electronic journal: http://cjtcs.cs.uchicago.edu/articles/2000/4/cj00-04.pdf.
[3] W. Aiello, R. Venkatesan, M. Yung, Coins, weights and contention in balancing networks, in: Proceedings of the 13th Annual ACM Symposium on
Principles of Distributed Computing, Augoust 1994, pp. 193–205.
[4] M. Ajtai, J. Komlós, E. Szemerédi, An O(n log n) sorting network, Combinatorica 3 (1983) 1–19.
[5] J. Aspnes, M. Herlihy, N. Shavit, Counting networks, Journal of the ACM 41 (5) (1994) 1020–1048.
[6] K.E. Batcher, Sorting networks and their applications, in: Proceedings of AFIPS Joint Computer Conference, vol. 32, April 1968, pp. 307–314.
[7] C. Busch, N. Hardavellas, M. Mavronicolas, Contention in counting networks, in: Proceedings of the 13th Annual ACM Symposium on Principles of
Distributed Computing, August 1994, p. 404.
[8] C. Busch, M. Herlihy, A survey on counting networks, in: Proceedings of Workshop on Distributed Data and Structures, March/April 1998, pp. 13–20.
[9] C. Busch, M. Herlihy, Sorting and counting networks of arbitrary width and small depth, Theory of Computing Systems 35 (2) (2002) 99–128.
[10] C. Busch, M. Mavronicolas, A combinatorial treatment of balancing networks, Journal of the ACM 43 (5) (1996) 749–839.
[11] C. Busch, M. Mavronicolas, P. G. Spirakis, The cost of concurrent, low-contention read&modify&write, Theoretical Computer Science 333 (3) (2005)
373–400.
[12] C. Dwork, M. Herlihy, O. Waarts, Contention in shared memory algorithms, Journal of the ACM 44 (6) (1997) 779–805.
[13] P. Fatourou, M. Herlihy, Read-modify-write networks, Distributed Computing 17 (1) (2004) 33–46.
[14] E.W. Felten, A. LaMarca, R. Ladner, Building counting networks from larger balancers, Technical Report 93-04-09, Department of Computer Science
and Engineering, University of Washington, April 1993.
[15] N. Hardavellas, D. Karakos, M. Mavronicolas, Notes on sorting and counting networks, in: A. Schiper (Ed.), Proceedings of the 7th International
Workshop on Distributed Algorithms, in: Lecture Notes in Computer Science, vol. 725, Springer-Verlag, Lausanne, Switzerland, September 1993,
pp. 234–248.
[16] M. Herlihy, N. Shavit, O. Waarts, Linearizable counting networks, Distributed Computing 9 (4) (1996) 193–203.
[17] M. Herlihy, S. Tirthapura, Randomized smoothing networks, Journal of Parallel and Distributed Computing 66 (5) (2006) 626–632.
[18] M. Herlihy, S. Tirthapura, Self stabilizing smoothing and balancing networks, Distributed Computing 18 (5) (2006) 345–357.
[19] E.N. Klein, A generic simulation of counting networks, Master’s Thesis, Computer Science Department, Rensselaer Polytechic Institute, July 2003.
[20] E.N. Klein, C. Busch, D.R.Musser, An experimental analysis of counting networks, Technical Report 06-13, Department of Computer Science, Rensselaer
Polytechnic Institute, 2006.
[21] M. Klugerman, C. G. Plaxton, Small-depth counting networks, in: STOC ’92: Proceedings of the Twenty-Fourth Annual ACM Symposium on Theory of
Computing, 1992, pp. 417–428.
[22] M. R. Klugerman, Small-depth counting networks and related topics, Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 1995.
[23] D. Knuth, Sorting and Searching, in: The Art of Computer Programming, vol. 3, Addison-Wesley, 1973.
[24] M. Mavronicolas, T. Sauerwald, The impact of randomization in smoothing networks, in: Proceedings of the Twenty-Seventh ACM Symposium on
Principles of Distributed Computing, PODC’08, New York, NY, 2008, pp. 345–354.
[25] N. Shavit, E. Upfal, A. Zemach, A steady state analysis of diffracting trees, Theory of Computing Systems 31 (4) (1998) 403–423.
[26] N. Shavit, A. Zemach, Diffracting trees, ACM Transactions on Computer Systems 14 (4) (1996) 385–428.
[27] S. Tirthapura, Adaptive counting networks, in: Proceedings of the International Conference on Distributed Computing Systems, ICDCS, 2005.
