--Antoine de Saint-Exup~ry, The Little Prince Summary, The counting problem requires n asynchronous processes to assign themselves successive values. A solution is linearizable if the order of the values assigned reflects the real-time order in which they were requested. Linearizable counting lies at the heart of concurrent timestamp generation, as well as concurrent implementations of shared counters, FIFO buffers, and similar data structures. We consider solutions to the linearizable counting problem in a multiprocessor architecture in which processes communicate by applying read-modify-write operations to a shared memory. Linearizable counting algorithms can be judged by three criteria: the memory contention produced, whether processes are required to wait for one another, and how long it takes a process to choose a value (the latency). A solution is ideal if it has low contention, low latency, and it eschews waiting. The conventional software solution, where processes synchronize at a single variable, avoids waiting and has low latency, but has high contention. In this paper we give two new constructions based on counting networks, one with low latency and low contention, but that requires processes to wait for one another, and one with low contention and no waiting, but that has high latency. Finally, we prove that 
--Antoine de Saint-Exup~ry, The Little Prince Summary, The counting problem requires n asynchronous processes to assign themselves successive values. A solution is linearizable if the order of the values assigned reflects the real-time order in which they were requested. Linearizable counting lies at the heart of concurrent timestamp generation, as well as concurrent implementations of shared counters, FIFO buffers, and similar data structures. We consider solutions to the linearizable counting problem in a multiprocessor architecture in which processes communicate by applying read-modify-write operations to a shared memory. Linearizable counting algorithms can be judged by three criteria: the memory contention produced, whether processes are required to wait for one another, and how long it takes a process to choose a value (the latency). A solution is ideal if it has low contention, low latency, and it eschews waiting. The conventional software solution, where processes synchronize at a single variable, avoids waiting and has low latency, but has high contention. In this paper we give two new constructions based on counting networks, one with low latency and low contention, but that requires processes to wait for one another, and one with low contention and no waiting, but that has high latency. Finally, we prove that A preliminary version of this work appeared in the Proceedings of the 32nd Annual Symposium on Foundations of Computer Science, San Juan, Puerto Rico, October 1991, pp. 526-535. [16] * Research partly supported by ONR N00014-91-,1-4052, ARPA Order 8225 ** Parts of this work were performed while at the MIT Lab. for Computer Science, supported by DARPA contracts N00014-89-.I-1988 and N00014-87-K-0825 and by a Rothschild postdoctoral fellowship *** Research supported in part by an NSF Postdoctoral Fellowship. Part of this work was done while the author was at Stanford University and supported by NSF grant CCR-8814921, U.S. Army Research Office Grant DAAL-03-91-G-0102, ONR contract N00014-88-K-0166, and IBM fellowship
Introduction
In the counting problem, n asynchronous concurrent processes repeatedly assign themselves successive values, such as integers or locations in memory. The linearizable counting problem requires that the order of the values assigned reflects the real-time order in which they were requested [17, 24] . For example, if k values are requested, then values 0 ... k -1 should be assigned, and if process P is assigned a value before process Q requests one, then P's value must be less than Q's. Linearizable counting lies at the heart of a number of basic problems, such as concurrent time-stamp generation, concurrent implementations of shared counters, FIFO buffers, and similar data structures (e.g. [8, 12, 22, 32] ).
The requirement that the values chosen reflect the real-time order in which they were requested is called linearizability [17] . The use of linearizable data abstractions greatly simplifies both the specification and the proofs of multiple instruction/multiple data (MIMD) shared memory algorithms. As discussed in more detail elsewhere [17] , the notion of linearizability generalizes and unifies a number of ad-hoc correctness conditions in the literature, and it is related to (but not identical with) correctness criteria such as sequential consistency [23] and strict serializability [28] .
Linearizable counting algorithms can be judged by three criteria: Because of limitations on processor-tomemory bandwidth, performance suffers when too many processes attempt to access the same memory location at the same time. Such "hot-spot" contention is welldocumented, and has been the subject of extensive research both in hardware [2, 11, 12, 20, 29] and in software [3, 9, 10, 27, 32] .
-Latency. The time needed to choose a value is strongly affected by the number of variables a process must access. We will show that (not surprisingly) there is an inherent (inverse) relationship between the maximum contention at a variable and the number of variables accessed.
-

Waiting.
Algorithms that require later processes to wait for earlier processes are not robust -the failure or delay of a single process will result in halting or delays in nonfaulty processes. All else being equal, it is preferable to choose algorithms that ensure that some processes make progress even when others halt in arbitrary locations. Moreover, the effect of a sequence of processes each waiting for an action of the previous one is in some cases similar to the effect of high latency protocols, at least for the last processes in the sequence.
Informally speaking, a linearizable counting algorithm is ideal if it has low contention, low latency, and it eschews waiting. In this paper, we will show that no ideal linearizable counting algorithm exists, but that it is possible to satisfy any two out of the three criteria.
First, consider the naive solution in which all n processes increment a single shared variable using a readmodify-write 1 operation. This algorithm has low latency (a single variable), it eschews waiting (the read-modify-write is assumed to be atomic), but has very high contention. (For more complete documentation of the performance problems of the single-variable solution see Anderson et al. [3] and Graunke and Thakkar [13] .)
Elsewhere [4] , Aspnes, Herlihy, and Shavit have proposed low-contention solutions to the (non-linearizable) counting problem based on a new class of data structures called counting networks. In this paper, we show how counting networks can be adapted to solve linearizable counting. Each of our counting protocols consists of an arbitrary nonlinearizable counting network coupled with a linearizing data structure called a filter. The combined construction has low contention provided that the counting network component has low contention. We first describe a constant-depth filter that requires processes to i A read-modify-write operation [12] atomically reads the value of a memory location, modifies it, writes it back, and returns the location's old value Finally, we prove that these trade-offs are a fundamental aspect of linearizable counting: any low-contention network that does not rely on waiting must have depth f2(n), where n is the number of processes. Since non-linearizable counting does have ideal solutions [4] with low contention, polylogarithmic depth, and no waiting, this result establishes a substantial complexity gap between linearizable and non-linearizable counting.
Background
A counting network, like a sorting network [6] , is a directed acyclic graph whose nodes are simple computing elements called balaneers, and whose edges are called wires. Each token (input item) enters on one of the network's w < n input wires, traverses a sequence of balancers, and leaves on an output wire. Unlike a sorting network, a w input counting network can count any number N ~> w of input tokens even if they arrive at arbitrary times, are distributed unevenly among the input wires, and propagate through the network asynchronously. Figure 2 shows a four-input four-output counting network. Intuitively, a balancer (see Fig. 1 ) is just a toggle mechanism that repeatedly alternates in sending tokens out on its output wires. Figure 2 shows an example computation in which input tokens traverse the network sequentially, one after the other. For notational convenience, tokens are labeled in arrival order, although these numbers are not used by the network. In this network, the first input (numbered 1) enters on wire 2 and leaves on wire 1, the second leaves on wire 2, and so on. (The reader is encouraged to try this for her/himself.) Thus, if on the i-th output wire the network assigns to consecutive output tokens the values i, i + 4, i + 2.4, ... ,it is counting the number of input tokens without ever passing them all through a shared computing element.
Counting networks are constructed to achieve a high level of throughput by decomposing interactions among processes into pieces that can be performed in parallel, effectively reducing memory contention.
In [4] , Aspnes, Herlihy and Shavit introduced counting networks and presented two O(log 2 n) depth counting network designs. Aharonson and Attiya [1] and Busch and Mavronicolas [26] proved several fan-in/out tradeoffs and cyclicity properties of such networks. The effects of high balancer fan-out were studied in [-21 ]. Klugerman and Plaxton [18] have shown an explicit network construction of depth O(c ~~ log n) for some small constant c, and an existential proof of a network of depth O(logn). This result was recently improved by Klugerman [19] to a constructive O(log n) network. Aiello, Venkatesan and Yung have shown randomized O(log n) constructions, and Shavit and Zemach have introduced highly efficient O(log n) depth networks called diffracting trees O(log n). Dwork, Herlihy, and Waarts [7] have recently devised a theoretical model for multiprocessor contention and used it to evaluate the properties of various counting networks.
Unfortunately, all known counting network constructions l-l, 4, 5, 18, 19, 21, 26, 31] are not linearizable. It is even possible for a process to shepherd two tokens through a network, one after the other, and by suitable overtaking, have the second token receive the lesser value. Can counting networks solve linearizable counting?
Overview
In this paper, we show that there are no linearizable counting networks. Nevertheless, it is possible to use counting networks to construct a number of interesting counting algorithms. Each of these linearizable algorithms is based on a two-part data structure. First, each token traverses a (non-linearizable) counting network. Second, the result is used as an index into a filter data structure that enforces linearizability.
In Sect. 3, we introduce the WAITING network, which combines a standard counting network with a WAmNG-FILTER data structure that forces later processes to wait for earlier processes. This combined construction yields a lowcontention linearizable counting protocol that requires that processes wait for one another.
In Sect. 4, we present two linearizable counting protocols that do not require waiting. The S~EW network construction combines a standard counting network with a filter in which each token takes an average of O(n) steps, although an individual token may take an infinite number of steps if it is infinitely often overtaken. The REVERSE-SKEW network combines a counting network with a filter in which every token takes no more than O(n 2) balancers, hence starvation is impossible.
In Sect. 5, we prove that the tradeoffs among our constructions is inherent. In any low-contention lineariz-able counting network, a token must traverse an average of t~(n) gates before taking a value. In [18, 19] it was shown that there exist width-n non-linearizable counting networks in which each token traverses at most O(log n) balancers. Our results therefore establish a substantial complexity gap between linearizable and non-linearizable data structures for counting. In other words, linearizability comes at a cost.
A brief introduction to counting networks
This section introduces counting networks. Our model for multiprocessor computation follows [17, 25] . The network definitions and examples are taken from [4] , where a more complete discussion of the properties of counting networks can be found.
The following discussion assumes an interleaving model of computation [25] , where there is no "global clock," and where the execution of an operation A is said to precede that of operation B according to the real-time order, if every atomic operation in the implementation of A precedes every atomic operation in the implementation of B [17, 24] .
Counting networks belong to a larger class of networks called balancing networks, constructed from wires and computing elements called balancers.
A balancer. A balancer is a computing element with two input wires, denoted as the north and south wires (and indexed by 0 and 1), and two output wires, similarly named. Tokens arrive on the balancer's input wires at arbitrary times and are output on its output wires. Intuitively, one may think of a balancer as a toggle mechanism, that given a stream of input tokens, repeatedly sends one token to the left output wire and one to the right, effectively balancing the number of tokens that have been output on its output wires. We denote by x;, ie {0, 1} the number of input tokens ever received on the balancer's i-th input wire, and similarly by Yi, ie{0, 1} the number of tokens ever sent on its i-th output wire. Throughout the paper we will abuse this notation and use xi (y~) both as the name of the i-th input (output) wire and a count of the number of tokens received on the wire.
Let the state of a balancer at a given point in the computation be defined as the collection of tokens on its input and output wires. For the sake of clarity we will assume that tokens are all distinct. We denote by the pair (t,b), the state transition is which the token t passes from an input wire to an output wire of the balancer b.
We can now formally state the safety and liveness properties of a balancer:
1. In any state Xo + xl > Yo + Yl (i.e. a balancer never creates output tokens). 2. Given any finite number of input tokens m = Xo + xl to the balancer, it is guaranteed that within a finite number of transitions, it will reach a quiescent state, that is, one in which the sets of input and output tokens are the same. In any quiescent state, Xo + xl = Yo + Yl = m. It is important to note that we make no assumptions about the "timing" of token transitions from balancer to balancer in the network the network's behavior is completely asynchronous. Although balancer transitions can occur concurrently, it is convenient to model them using an interleaving semantics in the style of Lynch (ti, bi) and ej = (tj, b j), where ti and tj are the same token, then all transitions between them also involve that token. In other words, tokens traverse the network one completely after the other.
In a MIMD shared memory multiprocessor, a balancing network is implemented as a data structure, where balancers are records and wires are pointers from one record to another. Each of the machine's n asynchronous processes runs a program that repeatedly traverses the data structure, each time shepherding a new token through the network (see the following Sect. 2.1). The limitation on the number of concurrent processes translates into a limitation on the number of tokens concurrently traversing the network:
We define the depth of a balancing network to be the maximal depth of any wire, where the depth of a wire is defined as 0 for a network input wire, and maxi~{o. 1}(depth(x~) + 1) for the output wires ofa balancer having input wires xi, is {0.. 1}.
A counting network. A counting network of width 2 w is a balancing network whose outputs Yo, -.-, Y~-~ have the step property is quiescent states:
2 Note that the width and depth of the network do not need to depend on the number of concurrent processes To illustrate this property, consider an execution in which tokens traverse the network sequentially, one completely after another. Figure 2 shows such an execution on the BITONIC [43 network defined in [4] . As can be seen, the network moves input tokens to output wires in increasing order modulo w. A balancing network having this property is called a counting network, because it can easily be adapted to count the number of tokens that have entered the network. Counting is done by adding a "local counter" to each output wire i, so that tokens coming out of that wire are consecutively assigned the numbers i, i + w, i + 2w, ... ,i + (y~-1)w. The number i + w.k assigned by the counter at the end of output wire i to the k-th token exiting on it, is called the token's value. We can now state the following simple yet useful lemma: 
Implementing a counting network
In this paper, we assume that counting networks are implemented on a multiprocessor in which processes communicate by applying read-modify-write operations to a shared memory. The counting network is implemented as a data structure in memory. A balancer is represented as a record with the following fields: toggle is a boolean value (initially True) and north and south are pointers which reference either other balancers, or counter cells. Processes shepherd tokens through the network by executing the code shown in Fig. 3 . Each process toggles the balancer's state by calling fetch&complement, which atomically complements the toggle field and returns the old value. Based on the toggle state, it goes to the north or south successor. When it encounters a counter, it atomically increments it by w and returns the old value. Note that Proof. Assume by way of contradiction that p is the token of lowest value v to violate this property. It must have seen location v-1 mod n in the array set of phase(v-1), a value that could only have been written by the token with value v-2kn-1, for some k > O. In particular, a token with value v -n -1 could not yet have written its phase bit, and thus by assumption, neither could any token with one of the n values v -n ... v -1. By the step property of the non-linearizable counting network component, since a token with value v exited the network, there must be at least n + 1 tokens currently traversing the network or past the network and before the phase change, that will take on the values v-n-1, v-n 
The waiting network
The WAITING network is a data structure with low contention and low latency, but that requires processes to wait for one another. As mentioned above, this data structure has two components: tokens first traverse a (non-linearizable) counting network component, and then they traverse a linearizing data structure called a WAITING-FILTER. The key idea behind this filter is simple: each token exiting the network waits for a token to take the next lower value. This solution is therefore not robust, since a failure or delay by one process will force other, non-faulty processes to balt or delay. Nevertheless, on a cache-coherent busbased multiprocessor, the WAITING network was observed to have contention and latency not much higher than that of its counting network component alone [-16 ], probably because the serializing effect of the bus masks the serializing effects of the filter. On a distributed memory architecture, however, the WAITING network had substantially lower throughput than its counting network component alone [,15] .
The WAITING-FILTER is similar to a barrier. After traversing the counting network, the WAITING-FILTER forces tokens with lower values to "catch up." A token leaves the filter only when all lower values have been assigned, guaranteeing that every token that enters the network later will receive a higher value. More precisely, a WAITING-FILTER is an n-element array of boolean values, called phase bits, where indexing starts from 0. Define the function phase(v) to be l_(v/n)J mod 2. We construct the new network by having tokens first traverse the counting network and then access the WAITING-FILTER. When a token exits the non-linearizable counting network with value v, it awaits its predecessor by going to location (v -1) (rood n) in the array, and waiting for that location to be set to phase(v -1). When this event occurs, it notifies its successor by setting location v to phase(v), and then it returns its value. 
Linearizable counting without waiting
In this section, we present two linearizable, low-contention counting protocols that do not require processes to wait for one another. Just as in the WAITING network given in the previous section, each token traverses a non-linearizable counting network followed by a "filter" data structure. The resulting combined network has low contention provided that the initial counting network has low contention. The first protocol is non-blocking: it guarantees that some token always emerges after the system as a whole has taken a bounded number of steps, but it allows individual tokens to run forever without taking a value (starvation). The second construction is wait-free: it guarantees that every token emerges after taking a fixed number of steps (no starvation). Both networks have high latency, with depth 12(n).
The Skew network
The SKEW-FILTER is an infinite balancing network illustrated in the left-hand-side of Fig. 4 (for now, ignore the empty balancers and the numeric labels). A SKEW-LAVER network is an unbounded size balancing network consisting of a sequence of balancers hi, for 0 < i. For bo, both input wires are network input wires. For all bi, the north output wire is a network output wire, and the south output wire is the north input wire for b~+l. A SKEW-FILTER with layer depth 4 d is constructed by layering d SKEW-LAVER networks so that the i-th output wire of one is the i-th input wire to the next.
This filter is combined with a non-linearizable counting network as follows. Each token first traverses the non-linearizable counting network, and then uses the resulting value as the index of its input wire into the infinite SKEW-FILTER. The correctness of our constructions is based on the following technical lemma, easily proved by induction on the number of balancers in a balancing network. 
Corollary 4.2. In any execution where no more than c tokens enter on any input wire, there are never more than c tokens on any wire.
The capacity c of an execution in which n tokens concurrently traverse a network is defined to be the maximal number of tokens that arrive on any input wire. Let the capacity c of a network be the maximum capacity over all executions. Corollary 4.2 implies that in a network with capacity c, no more than c tokens arrive on any internal or output wire during an execution involving n concurrent tokens.
In the SKEW-FILTER, when coupled with a counting network, the capacity c is 1, and thus at most one token enters or exits on each of a balancer's input/output wires. We can thus define the toggle state of a balancer to be the number of tokens it has output. Let a northwest barrier starting in balancer bk be a sequence of balancers bk, ..., bo, all in toggle state 2, where the north input wire of every bi is the south output wire of bi-1, and where bo's north input is wire 0. (In other words, the 'northwest barrier' is simply a partial network in some skew layer starting at balancer bk and ending in the first balancer in this layer.) It immediately follows from Corollary 4.2 that any token that approaches a balancer in a northwest barrier will be diverted below the barrier, effectively protecting all wires behind the barrier from late-arriving tokens. 
. Let q be a token that enters the SKEW-FILTER after token p has taken a value. If q traverses a higher numbered wire than p at layer k, then it does so at all layers greater than k.
Proof. Assume otherwise. Then, p's path and q's must cross. The only way two paths can cross in the SKEW-FILTER is if they traverse a common balancer. By Corollary 4.2, each balancer is visited by at most two tokens and since p got there first (i.e., in toggle state 0), p must exit on the north wire, and q on the south. [] Corollary 4.5. Let q be a token that enters the SKEW-FILTER after token p has taken a value. I f p and q pass through a common balancer, then q will take a higher value than p. 
texi, < t~n,e, then value(a) < value(b).
Proof. We argue inductively that this property is preserved among all tokens that have entered the SKEW-FILTER on wires less than or equal to k. When k = 0, the result is immediate, so assume the result for wires less than k > 0 .
We prove the result for wires less than or equal to k by way of contradiction. Assume that token p exits the SKEW network, and token q then enters the Skew network and exits with a value less than p's. Lemma 4.4 implies that q entered the filter on a lower numbered wire than p. The inductive hypothesis implies therefore that p enters the filter on wire k. There are two cases to consider: (1) p leaves some balancer b on its south wire, and (2) p leaves every balancer on its north wire.
In the first case, Lemma 4.3 implies that there is a northwest barrier extending from b to wire 0, and the token q must be diverted south (below the barrier) to higher numbered lines. Lemma 4.4 implies therefore that q will take a value greater than p's, a contradiction.
In the second case, if k < n -1 = d, then p goes north until it reaches wire 0, and the result is immediate. Otherwise, if k > n -1, then p goes north on n -1 balancers, and hence gets value k -n + 1. Since k > n -1, Lemma 2.1 applied to the non-linearizable counting network implies that at least k -n + 1 tokens must have entered the SKEW-FILTER on lines less than k and left it before p entered it. Therefore, since by Lemma 4.1 only one token can exit on a given output wire of the filter, there exists a token r that exited the network before p entered the filter, and took a value > k -n. It follows that r exits the network before q entered it, and by the induction hypothesis, it took a lesser value than q, since otherwise we would have a linearizability violation among the first k-1 lines. But in this case, q's value must be smaller than p's value >k-n+l and greater than r's value of k-n, a con-tradiction. [] 
The Reverse-skew network
Our second construction is the REVERSE-SKEW network. A REVERSE-SKEW network is the mirror image of the SKEW-LAVER. It consists of a sequence of balancers bi, for 0 < i. For bo, both output wires are network output wires. For all b~, i > 0, the south output wire is a network output wire, and the north output wire is the south input wire for 199 b,_ 1. A REVERSE-SKEW-FILTER of layer depth d is constructed by layering d REVERSE-LAYER networks so that the i-th output wire of one is the i-th input wire to the next. The protocol is the same as before: each token traverses the non-linearizable counting network, and uses its output value to choose the input wire into the REVERSE-SKEW- The proof of this theorem is omitted because it is nearly identical to that of Theorem 4.8. It uses one additional observation, which is: Lemma 2.1 implies that there is no violation of linearizability between any two tokens that enter the filter on input wires that are of distance greater than F(n-1)/2~w -1. Therefore, the northwest barrier created when some token exits the network, need only protect against tokens that entered on input wires that are less than F(n -1)/2-]w apart from its filter input wire.
The following lemma shows that the REVERSE-SKEW network is wait-free.
Lemma 4.11. The number of balancers traversed by any token in the REVERSE-SKEW-FILTER with layer depth
Proof Note that a token can exit on the south end of at most F(n -1)/27w -1 balancers. The number of the output wire on which a token exits is at most n -1 smaller than the number of the token's input wire in the filter, and therefore, a token can exit on the north end of at most n -1 + [-(n -1)/27w -1 balancers, and the claim follows. [] As in Lemma 4.9, the average number of balancers traversed by any token in the REVERSE-SKEW-FILTER is 2[-(n-1)/27w-2. To optimize the contention of the non-linearizable counting network, one may want to take w = n; in this case, the layer depth of the REVERSED-SKEW network is O(n2).
Implementing an infinite network
We now show how to represent the infinite SKEW-FILTER using a finite data structure. (The construction for the REVERSE-SKEW-FmTER is omitted, since it is nearly identical.) We first define a coordinate system for identifying balancers. Each balancer is denoted bi, j, where i ranges from 0 to infinity and j ranges from 0 to d -1 in a network of layer depth d. Balancer bi.0 is the first balancer whose north output wire is on row i, bi,d-1 is the last balancer on row i (equivalently, whose north output wire is on row i), and bi, j is balancer on layer j and on row i.
A folded SKEW-FILTER is a w width by d depth array of multi-balancers c~,~. The multi-balancer Co, o has two inputwires, each C;o, i > 0, has one input wire, and each c~,ahas one output wire. For 0 _< i _< w and 0 < j < d, there is one wire from c~,j to ci+ 1,i, where index arithmetic is mod w; and for 0 < i < w and 0 <j < d-1, there is also one wire from ci.j to ci, j+~. The multi-balancer ci,j simulates each of the balancers b~_j,j, b i j+w,j, bi-j+2w, j, ... 9 The folding of a SKEW-FILTER of layer depth d = 4 into a folded network with w = 4 and d = 4 is illustrated in Fig. 4 .
Like a balancer, a multi-balancer can also be represented as a record with toggle, north, and south fields. The north and south fields are still pointers to the neighboring multibalancers or counters, but the toggle component is more complex, since it encodes the toggle states of an infinite number of balancers. The following theorem shows that this infinite sequence has a simple structure. The toggle component of the multi-balancer c~,j can therefore be treated as a set containing (at most) 2n + 2 pairs (k, Sk) such that bi_j+kw,j =/: bi_j+(k-1)w,j, and an additional pair of (0, So). This set could be implemented with a short critical section (which introduces a small likelihood of blocking) or it could be implemented without blocking using read-modify-write operations as discussed elsewhere 1-14].
Lower bounds
We now show that it is impossible to construct an ideal linearizable counting algorithm, one with low contention, low latency, and without waiting. We give two results. The first concerns counting networks: first, any non-trivial s non-waiting linearizable counting network must have an infinite number of balancers, implying that the "folding" 5 The trivial counting network consists of a single balancer structure employed in the previous section's filter constructions is, in a sense, inescapable. The second concerns linearizable counting in general: in any non-waiting protocol, whether based on counting networks or not, contention and latency are inversely related.
The lower bound on the number of balancers is not as alarming as it sounds, since we have shown it is possible to "fold" an infinite number of balancers into a simple finite data structure. The time bound is more significant: in a low-contention non-waiting network, any process must traverse an average of t2(n) balancers before choosing a value. There exist non-linearizable counting networks with polylogarithmic depth [1, 4, 18] , and therefore nonwaiting linearizable counting networks will always have lower latency than their non-waiting non-linearizable counterparts.
Lower bounds on size
We first show that the only non-blocking linearizable counting network of finite width is the trivial one consisting of a single balancer. Given a nontrivial finite counting network, we construct an execution in which a later token overtakes an earlier token, resulting in non-linearizable behavior.
Theorem 5.1. There is no non-blocking finite-width linearizable counting network of width greater than two.
Proof. We assume such a network of width w and derive a contradiction. Let b be the last balancer on wire w -1. Send w tokens Po, ..-,Pw-1 sequentially through the network, where each Pi enters on input wire i. If a token arrives at balancer b, halt it on b's input wire, otherwise let it proceed until it takes a value. Lemma 4.1 implies that there is exactly one token on each input wire of b.
One of the halted tokens on b's input wires is Pw 1. To see why, consider the state of the network before pw-i enters. At least one token is halted before b. If all halted tokens resume their traversals, then the step property implies that exactly one token will have emerged on each of the wires 0, ... ,w -2, and none on w -1. Thus pw 1 must exit on wire w -1 and therefore is halted on one of b's input wires. Now let p~ 1 resume its traversal, taking a value less than w -1 (since there is at least one more halted token on the input wires to b), and send w more tokens qo, ---, q,-1 sequentially through the network, where each qi enters on input wire i. As before, if a token arrives at balancer b, halt it on b's input wire, otherwise let it proceed until it takes a value. Each q~ follows the same path as p~, and by similar reasoning, two q~ are halted before b, one being qw-1. The remaining w-2 > 0 tokens q~ will each take values greater than w -1. If q~ 1 resumes its traversal, it will be the second token to visit b, hence it will take w-1, violating linearizability. [] Note that we have actually proved a slightly stronger result. In the execution we constructed, no token overtakes another on a single wire, and therefore there is no nontrivial finite linearizable counting network even under the additional constraint that the wires between balancers are first-in-first-out. The theorem applies not only to strict counting networks but also to filter networks. The limitations implied by the theorem apply to combined network constructions in which each token traverses a non-linearizable counting network as an index into a linearizing filter network. Proof Suppose otherwise. Theorem 5.1 implies that the network has infinite width. The step property requires that each output wire of an infinite-width network be traversed no more than once in any finite execution. Consider a sequential execution in which token p enters on input wire i, runs uninterruptedly through the network, and emerges after d steps on output wire j. If we run 2 d additional tokens sequentially from input wire i, then the last token will follow exactly the same path as p, since the state of each balancer along the path will have been reset. Now two tokens have traversed output wire j, violating the step property. []
Lower bounds on time
In this section, we prove some fundamental lower bounds for any linearizable counting protocol that does not use waiting, whether or not it relies on counting networks. A protocol is defined as follows: each process applies readmodify-write operations to a sequence of variables and then chooses a value. A process may choose the next variable based on the values of earlier variables, but some process must decide after a finite number of steps (no waiting). The protocol's latency is the maximum number of variables any process visits before choosing its value. A protocol is quiescent if no process is in the process of choosing a value. In the protocols given so far, the variables correspond to balancers, and the latency corresponds to the network depth.
A path is a sequence of variables. In any protocol state, process p has preferred path u ifp would traverse u if it were run in isolation until choosing a value. If p would choose value v, then v is its preferred value. Define the capacity c of the protocol to be the maximal number of processes that access any particular variable in any execution. If c is high, so is the potential maximum number of concurrent accesses to a variable, so capacity is a measure of potential contention.
Consider a linearizable counting protocol for n processes with capacity c. Proof Consider the following execution. Suppose the protocol is in a quiescent state, and i-1 is the last value taken. For each process q distinct from p, run q in isolation until either.
1. q is about to choose value k. 2. q is about to access a variable in p's preferred path.
201
We claim the first case cannot occur. Since the protocol is in a quiescent state, all values less than i have been taken, and therefore any process that starts the protocol and runs uninterruptedly must choose i. If p and q can both run to completion without accessing a common variable, they will both choose i, a contradiction. Therefore q's path must eventually intersect p's preferred path.
By hypothesis, no more than c -1 processes can access any variable along p's path. Since every process's path must intersect p's path somewhere, the path must include This theorem has further implications for counting networks. Elsewhere, [4] we have shown that the set of balancers traversed by a set of tokens in a counting network does not depend on how transitions are interleaved, which implies: Corollary 5.5. In any execution of a counting network, the average number of balancers traversed by every token is I2(n/c).
Modeling contention
In this paper we approximate contention by capacity. Low capacity clearly implies low contention, but not vice versa. Subsequent to our work, Dwork, Herlihy and Waarts provided a more detailed complexity model for contention in multiprocessors [7] . Our notion of capacity is closely related to their notion of variable-contention, defined as the worst case number of concurrent accesses to any single variable occurring during an execution of the algorithm. Variable-contention can also be viewed as the contribution of a single variable to the overall contention of the algorithm. [7] consider a model in which simultaneous accesses to a single memory location are serialized: only one operation succeeds at a time, and other pending operations must stall. The contention of a concurrent object with concurrency n is defined as the worst case, over all executions of at most n concurrent processes, of the ratio of delays occurring over multiple (possibly concurrent) accesses to the object, divided by the number of accesses to the object.
Since we model executions by sequences of readmodify-write operations, c concurrent accesses could be transformed in our model into a sequence of c successive read-modify-write operations performed by c distinct processes on the same variable. With this in mind, the proof of Lemma 5.3 holds as stated for variable-contention c. Consequently Lemma 5.3, Theorem 5.4, and Corollary 5.5 also hold when c is the variable-contention. It follows that in any non-waiting protocol, whether based on a counting network or not, variable-contention and latency are inversely related. For more details the reader is referred to [7] .
Conclusion
The following joke circulated in Italy during the 1920's.
Mussolini claims that the ideal citizen is intelligent, honest, and Fascist. Unfortunately, no one is perfect, which explains why everyone is either intelligent and Fascist but not honest; honest and Fascist but not intelligent; or honest and intelligent but not Fascist.
The ideal linearizable counting algorithm has low contention, low latency, and does not require waiting. Unfortunately, Theorem 5.4 shows that no ideal algorithms exist. The best algorithms one can devise either have low latency and no waiting but high contention (like the single shared variable), low contention and low latency but require waiting (like the WAITING-FILTER), or low contention and no waiting but high latency (like the SKEW-FILTER and REVERSE-SKEW-FILTER constructions). She is currently a research fellow in the Computer Science Department at the University of California at Berkeley, sponsored by an NSF postdoctoral fellowship. Her research interests include algorithms and lower bounds, particularly in the fields of on-line algorithms, distributed computation, networks, and computational biology.
Maurice Herlihy
