An inherent bottleneck in distributed counting by Wattenhofer, Roger & Widmayer, Peter
Research Collection
Report
An inherent bottleneck in distributed counting
Author(s): 
Wattenhofer, Roger; Widmayer, Peter
Publication Date: 
1997
Permanent Link: 
https://doi.org/10.3929/ethz-a-006651923
Rights / License: 
In Copyright - Non-Commercial Use Permitted
This page was generated automatically upon download from the ETH Zurich Research Collection. For more
information please consult the Terms of use.
ETH Library
An Inherent Bottleneck in Distributed Counting
Roger Wattenhofer Peter Widmayer
Institut fur Theoretische Informatik
ETH Zurich Switzerland
Abstract
A distributed counter allows each processor in an asynchronous message passing network
to access the counter value and increment it We study the problem of implementing a
distributed counter such that no processor is a communication bottleneck We prove a lower
bound of k on the number of messages that some processor must exchange in a sequence
of n counting operations spread over n processors where kk
k
 n We propose a counter that
achieves this bound for the situation it is derived for namely when each processor increments
the counter exactly once Hence the lower bound is tight Because most algorithms and
data structures count in some way the lower bound holds for many distributed computations
We feel that the proposed concept of a communication bottleneck is a relevant measure of
eciency for a distributed algorithm and data structure because it indicates the achievable
degree of distribution
  Introduction
Counting is an essential ingredient in virtually any computation It is therefore highly desirable
to implement counters eciently In a distributed setting eciency has a number of important
constituents such as time or message complexity Various precise measures of eciency for
these constituents have been established in the literature for instance the time complexity of
a distributed algorithm in an asynchronous setting measures the worst case time from the start
of a run to its completion based on the assumption that each message takes only one time unit
One important aspect of eciency however is mostly taken into account only on an intuitive
basis in the construction of distributed algorithms and data structures The work of the
algorithm should not be concentrated at any single processor or within a small group of
processors even if this optimizes some measure of eciency For instance even though a
data structure implementing a distributed counter could be message optimal by just storing
the counter value with a single processor and having all other processors access the counter with
only one message exchange such an implementation is clearly unreasonable This solution does
not scale  whenever a large number of processors operate on the counter the single processor
handling the counter value will be a bottleneck
This paper studies the bottleneck that is inherent in any counting mechanism in a distributed
asynchronous setting We characterize this bottleneck by deriving a lower bound on the number
of messages that some processor must handle in a sequence of operations Even in the simple
case where testandincrement is the only supported operation the lower bound holds It carries
over directly to any data structure in which the e	ect of an operation critically depends on the
preceding operation
s  that is virtually any other data structure We also propose a distributed

counter with an optimum bottleneck for the specic counting problem that we use in our lower
bound proof This shows that the lower bound is tight
Related Work
To the best of our knowledge this is the rst study of the inherent bottleneck in a distributed
data structure Therefore in the literature there are no close relatives to this paper Two tracks
of research in distributed computation however relate loosely to our work namely distributed
counters and quorum systems
Ecient implementations of a distributed counter received considerable attention in the past few
years The Combining Trees proposed in YTL and in GVW were the rst to explicitly
aim at avoiding a bottleneck Based on bitonic sorting networks AHS HLS developed
Counting Networks they have been analyzed in DHW HSW Di	racting Trees by SZ
enjoy the benets of Combining Trees and Counting Networks they are analyzed and tested in
SUZ
Some of the reasoning in our paper is closely related with that in quorum systems A
quorum system is a collection of sets of elements where every two sets in the collection
intersect The foundations of quorum systems have been laid in GB and Mae Nei
contains a good survey on the topic A dozen ways have been proposed of how to construct
quorum systems starting in the early seventies Lov EL and continuing until today
HMP PW KM PW Even though the approach we present in this paper might
be called a Dynamic Quorum System it has nothing in common with JM
 Distributed Counting The Model
Consider an asynchronous distributed system of n processors in a message passing network
where each processor is uniquely identied with one of the integers from  to n Each processor
has unbounded local memory there is no shared memory Any processor can exchange messages
directly with any other processor There is no a priori bound on the length of a message A
message arrives at its destination an unbounded but nite amount of time after it has been
sent No failures whatsoever occur in the system
An abstract data type distributed counter is to be implemented for such a distributed system
A distributed counter encapsulates an integer value val and supports the operation inc for any
processor inc returns the current counter value val to the requesting processor and increments
the counter 
by one For the sake of deriving a lower bound we will ignore concurrency
control problems that is we will not propose a mechanism that synchronizes messages and
local operations at processors Let us therefore assume that enough time elapses in between any
two inc requests to make sure that the preceding inc operation is nished before the next one
starts
An inc operation 
request initiates a process ie a partially ordered set of events in the
distributed system Let us examine the process of a single inc operation Let p be the processor
that initiates the inc operation To do so p sends a message to several other processors these
in turn send messages to others and so on After a nite amount of time p receives the last of
the messages which lets p determine the current value val of the counter 
for instance p may
simply receive the counter value in a message This does not necessarily terminate the inc
process Additional messages may be sent in order to prepare for future operations As soon as

no further messages are sent the inc process terminates During this process the counter value
has been incremented
11
17
7
3
11
17
27
Figure  Processor  initiates an inc operation
We can visualize the process of an inc operation as a directed acyclic graph 
DAG A node
with label q of the DAG represents processor q performing some communication In general a
processor may appear more than once as a node label in particular the initiating processor p
appears as the source of the DAG and somewhere else in the DAG where p is informed of the
current counter value val An arc from a node labelled p
 
to a node labelled p

denotes the
sending of a message from processor p
 
to processor p

 For initiating processor p let I
p
denote
the set of all processors that send or receive a message during the observed inc process The
following lemma appears in similar form in many papers on quorum systems Mae
Hot Spot Lemma Let p and q be two processors that increment the counter in direct
succession Then I
p
  I
q
  must hold
Proof For the sake of contradiction let us assume that I
p
 I
q
  Because only the processors
in I
p
can know the current value val of the counter after ps inc process none of the processors
in I
q
knows about the inc operation initiated by p Therefore q gets an incorrect counter value
a contradiction  
Note that the argument in the Hot Spot Lemma can be made for the family of all distributed
data structures in which an operation depends on the operation that immediately precedes it
Examples for such data structures are a bit that can be accessed and ipped and a priority
queue In this paper however we restrict our attention to the distributed counter
The process of an inc operation is not a function of the requesting processor and the state of
the system due to the nondeterministic nature of a distributed computation Here the state
in between any two operations is dened as a vector of the local states of the processors Now
consider a prex of the DAG of a process proc in state s of the system and consider the set pref
of processors present in that prex Then for any state of the system that is di	erent from s but
is identical when restricted to the processors in pref the considered prex of proc is a prex of
a possible process The reason is simply that the processors in pref cannot distinguish between
both states and therefore any partially ordered set of events that involves only processors in
pref and can happen nondeterministically in one state can happen in the other as well
 A Lower Bound on the Message Load
Denitions Consider a sequence of consecutive inc operations of a distributed counter Letm
p
denote the number of messages that processor p sends or receives during the operation sequence

we call this the message load of processor p Choose a processor b with m
b
 max
p n
m
p
and
call b a bottleneck processor
We will derive a lower bound on the load for the interesting case in which not too many
operations are initiated by any single processor One can easily show that the amount of
achievable distribution is limited if many operations are initiated by a single processor To
be even more strict for the lower bound we request that each processor initiates exactly one inc
operation
Lower Bound Theorem In any algorithm that implements a distributed counter on n
processors there is a bottleneck processor that sends and receives 
k messages where kk
k
 n
Proof To simplify the argument let us replace the communication DAG of an inc process by
a topologically sorted linear list of the nodes of the DAG This communication list models the
DAG so that each message along an arc in the DAG corresponds to a sequence of messages
along a path in the list By counting each arc in the list just once we get a lower bound on the
number of messages per processor in the DAG because no processor has more incoming arcs to
nodes with its label in the list than in the DAG
1117 7 311 17 27
Figure  Example of Figure  as a list
Now let us study a particular sequence of n inc operations where each processor initiates one
operation For each operation in the sequence there may be more than one possible process We
will argue on possible prexes of processes for each of the operations The sequence of operations
is dened as follows For each operation in the sequence we choose a processor 
among those
that have not been chosen yet and a process such that the processors communication list is
longest where the length is measured as the number of arcs in the list Let processor i denote
the processor that is chosen for the ith operation in the sequence and let L
i
be the length of
the chosen communication list of processor i Thus the number of messages that are sent for
the ith operation is exactly L
i
in the list For the total of n inc operations the number of
messages sent is
P
n
i 
L
i
 let us denote this number as n  L with L as the average number
of messages sent Because every sent message is going to be received by a processor we have
P
n
p 
m
p
 nL This guarantees the existence of a processor b with m
b
 d
nL
n
e  L
.
.
.
10 7 3 271 1
18 1 22
31 5 33 18
31 2 nn 27 9
Figure  Situation before initiating an inc operation
We will now argue on the choice of a processor for the ith inc operation This choice is made
according to the lengths of the communication lists of processors that have not incremented yet

see Figure  We will compare the list lengths of the chosen processor and the processor that
is chosen only for the very last inc operation Let q be this last processor in the sequence and
consider the list of processor q for the ith inc operation Let l
i
denote the length of this list
By denition of the operation sequence l
i
 L
i

i       n Let p
ij
denote the processor
label of the jth node of the list where j        l
i
 By denition of the communication list
we have p
i
 q for i       n
Now dene the weight w
i
for the ith inc operation as
w
i

l
i
X
j
m
p
ij


j

where m
p
ij
 is the number of messages that processor p
ij
sent or received before the ith inc
operation and   m
b
  Initially we have m
p   for each p       n and therefore
w
 
 
How do the weights of the two consecutive lists of processor q di	er when an inc operation is
performed To see this let us compare the weights w
i
and w
i 
 The Hot Spot Lemma tells us
that at least one of the processors in qs list must receive a message in the ith inc operation
let p
if
be the rst node with that property in the list The list for inc operation i can di	er
from the list for inc operation i in all elements that follow p
if
 including p
if
itself but there
is at least one process in which it is identical to the list before inc operation i in all elements
that precede p
if

formally p
i j
 p
ij
for j       f The reason is that for none of the
processors preceding p
if
 the 
knowledge about the system state changes due to the ith inc
operation
This immediately gives
w
i 
 w
i



f

l
i 
X
jf 
m
p
i j


j

l
i
X
jf 
m
p
ij


j
 w
i



f

l
i
X
jf 
m
p
ij


j
 w
i



f

l
i
X
jf 
  

j
 w
i



f
 



f



l
i

 w
i



l
i
We therefore have
w
n

n  
X
i 


l
i
Processor q sent and received at least m
p
n
 messages in the sequence of n inc operations We
have
m
p
n
  w
n

l
n
X
j 
m
p
nj


j

With     m
b
 m
q
 m
p
n
 we get
  w
n

l
n
X
j 
m
p
nj


j
 
 w
n

l
n
X
j 
 

j
 
 w
n
 



l
n
  
 w
n



l
n

n
X
i 


l
i
 n
n
v
u
u
t
n
Y
i 


l
i
 n
n
q

 
P
n
i
l
i
 n
n
q

 
P
n
i
L
i
 n
n
q

 nL

n

L


That is  
L 
p
n With   m
b
 L  L we conclude   k where kk
k
 n Since
m
b
   this proves the claimed lower bound  
 A Matching Upper Bound
We propose a distributed counter that achieves the lower bound of the previous section in the
worst case It is based on a communication tree whose root holds the counter value The leaves
of the tree are the processors that request inc operations The inner nodes of the tree serve the
purpose of forwarding an inc operation request to the root Recall that each of the n processors
requests exactly one inc operation
The communication tree structure is as follows Each inner node in the communication tree has
k children All leaves of the tree are on level k  the root is on level zero Hence the number
of leaves is kk
k
 For simplicity let us assume that n  kk
k

otherwise simply increase n to the
next higher value of the form kk
k
 for integer k
Each inner node in the tree stores k  values It has an identier id that tells which processor
currently works for the node let us call this the current processor of the node For simplicity
we do not distinguish between a node and its current processor whenever no ambiguity arises
Furthermore it knows the identiers of its k children and its parent In addition it keeps track
of the number of messages that the node sent or received since its current processor works for
it we call this its age
Initially node j 
j       k
i
  on level i 
i       k gets the identier

i k
k
 jk
k i
 

...
...
...
level 0root
level 1
level 2
.
.
.
1 2 k
level k
1 2 k
1 2 k
Figure  Communication Tree Structure
Numbered this way no two inner nodes on levels  through k get the same identier
Furthermore the largest identier 
used for the parent of the rightmost leaf has the value

k  k
k
 
k
k
 k
k k
   kk
k
 k
k
 k
k
 k

   kk
k
 n
We will make sure that no two inner nodes on levels  through k ever have the same identiers
The root nevertheless starts with id   The leaves have identiers       n from left to
right on level k   representing the n processors Since all ids are dened by this regular
scheme all the processors can compute all initial identiers locally The age of all inner nodes
including the root is initially  The root stores an additional value the counter value val where
initially val  
The inc Operation
Now let us describe how an inc operation initiated at processor p is carried out The leaf whose
id is p sends a message inc from p to its parent Any nonroot node receiving an inc from
p message forwards this message to its parent and increments its age by two 
one for receiving
and one for sending a message When the root receives an inc from p message it sends a
message val to processor p and then increments val furthermore it increments its age by
two After incrementing its age value a node decides locally whether it should retire It will
retire if and only if it has age  k To retire the node updates its local values by setting
age   und id
new
 id
old
  it then sends k   nal messages k   messages inform the
new processor of its new job and the of the ids of its parent and children nodes the other k 
messages inform the nodes parent and children about id
new
 Note that in this way were able
to keep the length of messages as short as O
logn bits There is a slight di	erence when the
root retires It additionally informs the new processor of the counter value val and it saves the
message that would inform the parent Since the parent and children nodes receive a message
they increment their age values It may of course happen that this increment triggers the
retirement of parent and children nodes If so they again inform their parent their children and
the new processor as described For simplicity we do not describe here the details of handling
the corresponding messages one way of solving this problem is a proper handshaking protocol
with a constant number of extra messages for each of the messages we describe

The message load
While correctness is straightforward and is therefore omitted we will now derive a bound on
the message load in detail
Retirement Lemma No node retires more than once during any single inc operation
Proof Assume to the contrary that there is such a node and let u be the rst node 
in historic
order that retires a second time Since u is rst all children and the parent of u retired only
once during the current inc operation Therefore u receives at most k   messages Since
k    k for k   node u cannot retire twice  
Grow Old Lemma If an inner node does not retire during an inc operation it sends and
receives at most four messages
Proof Let p be the processor that initiates the inc operation Each inner node u that is on the
path from leaf p to the root receives one message from its child v on that path and it forwards
this message to its parent Among all nodes adjacent to u only its parent and v can retire during
the current inc operation because us other children are not on the path from p to the root
and belong to I
p
only if u retires Due to the Retirement Lemma no node can retire more than
once during a single inc operation thus u does not receive more than two retirement messages
To sum up a node u receives one message if its parent retires and not more than three further
messages if u is on the path from p to the root  
Number of Retirements Lemma During the entire sequence of n inc operations each node
on level i retires at most k
k i
  times
Proof The root lies on each path and therefore receives at most two messages per inc operation
and sends one message 
the counter value It retires after every k messages with the total
number r

of retirements satisfying
r


n
k



k
k
 k
k

In general a node on level i is on k
k i 
paths and it receives and sends at most k
k i 
r
i  
messages With a retirement at every k messages we inductively get a total for the number r
i
of retirements of a node on level i
r
i


k

k
k i 
 r
i  
 

k

  k
k i 
 k
k i
 
Let us now consider the availability of processors that replace others when nodes retire The
initial ids at inner nodes on levels  through k have been dened just for the purpose of providing
a suciently large interval of replacement processor identiers The jth node 
j       k
i
 
on level i 
i       k initially uses processor 
i k
k
 jk
k i
  its replacement processor
candidates are those with identiers

i k
k
 jk
k i
 f      k
k i
g
Note that these are exactly k
k i
  processors just as needed in the worst case In addition
note that the root replaces its processor k
k
  times

Inner Node Work Lemma Each processor receives and sends at most O
k messages while
it works for a single inner node
Proof When a processor starts working for a node it receives k messages from its predecessor
that tells about the identiers of its parent and its children From the Grow Old Lemma we
conclude that it receives and sends at most k messages before it retires Upon its retirement
it sends k   messages to its successor and one to its parent and to each of its children  
Leaf Node Work Lemma During the entire sequence of n inc operations each leaf receives
and sends at most  messages
Proof Each leaf initiates exactly one inc operation and receives an answer accounting for two
messages It receives an extra message whenever its parent retires Since the parent is on level
k the Retirement Lemma tells us that this happens
k
k k
   k

   
times  
Bottleneck Theorem During the entire sequence of n inc operations each processor receives
and sends at most O
k messages where kk
k
 n
Proof Each processor starts working at most once for the root and at most once for another
inner node From the Number of Retirements Lemma and the Inner Node Work Lemma we
conclude that the load for this part is at most O
k messages From the Leaf Node Work
Lemma we get two additional messages with a total of O
k messages as claimed  
Acknowlegements
Wed like to thank Masafumi !Mark Yamashita and Thomas Roos for helpful discussions and
pointers to the literature
References
AHS James Aspnes Maurice Herlihy and Nir Shavit Counting networks and multi
processor coordination In Proceedings of the Twenty Third Annual ACM Symposium
on Theory of Computing pages  New Orleans Louisiana  May 
DHW Cynthia Dwork Maurice Herlihy and Orli Waarts Contention in shared memory
algorithms In Proceedings of the TwentyFifth Annual ACM Symposium on Theory of
Computing pages  San Diego California  May 
EL Paul Erd"os and L#aszl#o Lov#asz Problems and results on chromatic hypergraphs and
some related questions In Innite and Finite Sets pages  
GB Hector GarciaMolina and Daniel Barbara How to assign votes in a distributed system
Journal of the ACM 
 October 
GVW James R Goodman Mary K Vernon and Philip J Woest Ecient synchronization
primitives for largescale cachecoherent multiprocessors In Third International Con
ference on Architectural Support for Programming Languages and Operating Systems
pages  Boston Massachusetts  April  ACM Press

HLS Maurice Herlihy BengHong Lim and Nir Shavit Low contention load balancing
on largescale multiprocessors In Proceedings of the th Annual ACM Symposium on
Parallel Algorithms and Architectures pages  San Diego California June 
July   SIGACT$SIGARCH
HMP Ron Holzman Yosi Marcus and David Peleg Load balancing in quorum systems
In Proceedings of the th International Workshop on Algorithms and Data Structures
WADS	 volume  of LNCS pages  Berlin GER August  Springer
HSW Maurice Herlihy Nir Shavit and Orli Waarts Linearizable counting networks
DISTCOMP
 Distributed Computing  
JM Sushil Jajodia and David Mutchler Dynamic voting algorithms for maintaining the
consistency of a replicated database ACM Transactions on Database Systems ACM
CR  
 June 
KM Akhil Kumar and Kavindra Malik Optimizing the costs of hierarchical quorum
consensus Acta Informatica 
 
Lov L#aszl#o Lov#asz Coverings and colorings of hypergraphs In Proc th Southwestern Conf
Combinatorics Graph Theory and Computing pages  
Mae Mamoru Maekawa A
p
N algorithm for mutual exclusion in decentralized systems
ACM Transactions on Computer Systems 
 May 
Nei Mitchell L Neilsen Quorum Structures in Distributed Systems PhD thesis Dept
Computing and Information Sciences Kansas State University 
PW David Peleg and Avishai Wool Crumbling walls A class of practical and ecient
quorum systems In Proceedings of the th Annual ACM Symposium on Principles of
Distributed Computing pages  ACM August 
PW David Peleg and Avishai Wool How to be an ecient snoop or the probe complexity
of quorum systems 
extended abstract In Proceedings of the th Annual ACM
Symposium on Principles of Distributed Computing pages  ACM May 
SUZ Nir Shavit Eli Upfal and Asaph Zemach A steady state analysis of di	racting
trees In Proceedings of the th Annual ACM Symposium on Parallel Algorithms and
Architectures pages  Padua Italy June   SIGACT$SIGARCH
SZ Nir Shavit and Asaph Zemach Di	racting trees In Proceedings of the th Annual
Symposium on Parallel Algorithms and Architectures pages  New York NY
USA June  ACM Press
YTL PenChung Yew NiauFeng Tzeng and Duncan H Lawrie Distributing hotspot
addressing in large scale multiprocessor In International Conference on Parallel
Processing pages  Los Alamitos Ca USA August  IEEE Computer Society
Press

