Counting networks are concurrent data structures that serve as building blocks in the design of highly scalable concurrent data structures in a way that eliminates sequential bottlenecks and contention. Linearizable counting networks assure that the order of the values returned by the network re ects the real-time order in which they were requested. Linearizability is an important consistency condition for concurrent data structures, as it simpli es proofs and enhances compositionality.
INTRODUCTION
Counting networks 4] are a class of highly scalable structures used for concurrent counting. Such networks allow the design of concurrent data structures in a way that eliminates sequential bottlenecks and contention. Unlike queue-locks 21] and combining trees 13] which are based on a single counter location handing out indices, counting networks hand out indices from a collection of counter locations. To guarantee that indices handed out by the separate counters are not erroneously \duplicated" or \omitted," one adds a special network coordination structure to be traversed by processes before accessing the counters.
Counting networks 4] are constructed from simple computing elements called balancers (see Figure 1 ). Tokens arrive on the balancer's input wires and are output on its output wires. Intuitively one may think of a balancer as a toggle mechanism that, given a stream of input tokens, repeatedly sends one token to the left output wire and one to the right, e ectively balancing the number of tokens that have been output. In order to form a counting network, balancers are connected to one another by wires in an acyclic fashion, in the same way comparators are connected to form a sorting network 11]. However, unlike in sorting networks, counting networks are asynchronous in nature, that is, tokens arrive at the network's input wires at arbitrary times, and traverse the network with di ering pace. Nevertheless, if the balancers are connected correctly, a network having w consecutively numbered output wires will move input tokens to output wires in increasing order modulo w.
Networks of balancers having this property can easily be adapted to count the total number of tokens that pass through them. Counting is done by adding a \local counter" to each output wire i, so that tokens coming out of that wire are assigned numbers i; i + w; i + 2w, and so on.
On a shared memory multiprocessor, counting networks are implemented as data structures in which balancers are represented as records and wires as pointers among them. Tokens are \shepherded" by processors that traverse this pointer-based data structure from input pointers to output wires, nally incrementing the counter on the appropriate output wire. This implies that tokens may overtake one another on a wire and that balancer and network traversal times are dependent on individual processor speeds and variations in speeds.
A Bitonic counting network 4] has a layout isomorphic to Batcher's Bitonic sorting network 7] . Bitonic counting networks for n processors have width w < n and depth (log 2 w) (all logarithms in this paper are to the base 2). Unlike combining trees, counting networks support complete independence among requests and are thus highly fault tolerant. At peak performance their throughput is w, as w indices are returned per time step by the independent counters. Unfortunately, counting networks su er a performance drop-o due to contention as concurrency increases, and the latency in traversing them is a high (log 2 w). There is a wide body of research on counting networks 2; 3; 4; 9; 10; 12; 15; 17 ; 18] . A recently developed form of counting network called a Di racting Tree 24] is based on a new type of distributed balancer implementation. It has been shown to scale especially well, exhibiting low latency since its depth is logarithmic in w.
- Linearizability is a consistency condition for concurrent systems formulated by Herlihy and Wing 16] . It requires that the values returned by access requests to a concurrent shared object re ect the order in which they were issued. The use of linearizable data abstractions simpli es both the speci cation and the proofs of multiple instruction/multiple data shared memory algorithms. As Herlihy and Wing explain, linearizability generalizes and uni es a number of ad-hoc correctness conditions in the literature, and is related to (but not identical with) correctness criteria such as sequential consistency 19] and strict serializability 22].
Herlihy, Shavit, and Waarts 15] de ned the class of linearizable counting networks, networks that assure that the order of the values returned by the network re ects the real-time order in which they were requested. Linearizable counting lies at the heart of concurrent timestamp generation, as well as concurrent implementations of shared counters, FIFO bu ers, priority queues and similar data structures. Unfortunately, for both the Bitonic networks of Aspnes, Herlihy, and Shavit 4] and the Di racting Trees of Shavit and Zemach 24] , there exist worst case asynchronous schedules in which linearizability is violated. In 15] linear depth linearizable counting network constructions were presented and shown to be optimal, that is, any low contention counting network that is linearizable in all executions must have linear depth.
Timing and Linearizability
This paper provides a characterization of the timing conditions under which low depth non-linearizable counting networks become linearizable. It applies to semisynchronous and real-time systems 6] where upper and lower time bounds that limit the extent to which one process can be slower or faster than others are known. As we show, our characterization also extends beyond such systems and has implications in the analysis of counting network linearizability in general asynchronous multiprocessor systems. We believe that the linear time cost of designing counting networks achieving linearizability under all circumstances may be an unnecessary burden on applications that are willing to trade-o occasional non-linearizability for speed and parallelism. In such systems an intelligent trade-o decision can be made with the help of clear characterization of the parameters governing linearizability.
Our main result is a simple timing condition that is local to the individual wires and balancers of the network. It quanti es the extent to which a network can su er from timing anomalies and still remain linearizable.
This result is interesting, since even a counting network of depth one exhibits non-linearizable behavior. Consider the following scenario for a counting network consisting of the balancer B and two atomic counters A 0 and A 1 with initial values 0 and 1, and that count by 2: Token T 0 enters the balancer via x 0 , exits via y 0 , and then is delayed. Token T 1 enters via x 0 and exits via y 1 and obtains the value 1 from the counter A 1 . Token T 2 enters via x 0 and exits via y 0 and obtains the value 0 from the counter A 0 . Finally T 0 obtains the value 2 from A 0 . The behavior is not linearizable because the traversal of the network by T 1 completely precedes T 2 , yet T 2 returns a lower counter value.
We use a c 1 =c 2 timing model in the style of Attiya, Dwork, Lynch, and Stockmeyer 5]. Let c 1 be the minimum time that it takes for a token to traverse a wire from balancer to balancer, let c 2 be the maximum such time, and assume that balancer transitions are instantaneous. This timing model is general enough to capture standard message passing and shared memory balancer implementations 4; 24] . Alternately, one could attribute the c 2 =c 1 latency to the balancer traversal and make wire traversal instantaneous. The two models can be shown to be equivalent, and we choose to attribute delays exclusively to the wires as this simpli es our modeling and presentation.
Our model is also similar to that of semi-synchronous systems (cf. Archimedean distributed systems of Vitanyi 25]). One can view our setting as one in which each token traverses a wire and a balancer on the local clock tick, where the local clocks can tick not faster than every c 1 , and not slower than every c 2 time units according to some global clock.
A common structuring property of almost all published counting networks 2; 4; 3; 9; 12; 15; 17; 18; 23; 24] is uniformity: each balancer of the network lies on some path from inputs to outputs, and all paths from inputs to outputs have equal lengths.
We prove, in Section 3, the following properties for any uniform counting network (explicitly constructible or not): |If c 2 In any quiescent state, 0 Y i ? Y j 1 for any i < j.
The step property of counting networks is the cornerstone of the claims and proofs we will present.
We now add timing to our model. The state transition of a balancer, i.e., the passing of a token from the balancer's input port to its output port, will be modeled as an instantaneous event. While balancer transitions are instantaneous, transitions De nition 2.1 A counting network is uniform if each balancer of the network lies on some path from inputs to outputs, and all paths from inputs to outputs have equal lengths.
We de ne the depth of a uniform counting network as the number of wires on the path between any input balancer and output counter. The time t it takes for a token to traverse a uniform network of depth h is bounded by: h c 1 t h c 2 . It is easy to see, from the above de nition, that for each balancer B, the lengths of all paths from the input balancers to B are equal and the lengths of all paths from B to the output balancers are equal, see Figure 2 . Note that there and in the remaining gures, we do not show the counters attached to the outputs. For 1 g (h + 1) we also de ne the g-th layer of a network to be the collection of nodes (balancers or counters) whose distance from the inputs is g ? 1. In the proofs, without loss of generality, we sequentially number the tokens traversing the network according to the time of their entry (ties are broken arbitrarily).
An execution or execution sequence of a network is a sequence E = e 1 ; e 2 ; : : : of instantaneous transition events e i = hT; Bi corresponding to a token T traversing a balancer or counter B. We associate history variables with tokens and balancers to De nition 2.3 An execution of a counting network is linearizable if for any two tokens that traverse the network one completely after another (non-overlapping in time), the earlier token obtains a smaller value than the later one.
De nition 2.4 A counting network is linearizable if every execution of the network is linearizable.
We now introduce the notion of non-linearizable operations. Consider an execution in which the network traversal operation completely precedes another traversal operation , but returns a higher value than . Clearly such an execution is not linearizable. In the de nition below we ascribe the non-linearizablilty of the execution to the operation :
De nition 2.5 Given an execution of a counting network, we say that a traversal operation and its associated token are non-linearizable, if there exists some other traversal operation completely preceding in time, whose associated token has a higher returned value than .
We choose to de ne as the non-linearizable operation and not since this allows us to determine whether or not an operation is non-linearizable as soon as it completes. Furthermore, if instead were de ned to be the non-linearizable traversal operation, this would lead to non-intuitive situations where a single operation can cause all preceding operations to become non-linearizable if it returns a su ciently low value.
It is easy to see that for any execution sequence, if we remove all non-linearizable traversal operations the remaining sequence of operations will contain no violations of linearizability 1 . However, such sequence of operations might not correspond to a valid execution of a counting network, since it could contain gaps.
The following de nition quanti es non-linearizability of nite executions:
De nition 2.6 The fraction of non-linearizable operations in a nite execution is de ned to be the number of non-linearizable operations divided by the number of completed operations in the execution.
It follows from the de nitions above that this fraction is an upper bound on the fraction of operations whose removal yields a linearizable execution trace.
A CHARACTERIZATION OF LINEARIZABILITY FOR COUNTING NETWORKS
In this section and the next, we show that the ratio c 2 =c 1 plays a key role in determining whether a uniform counting network is linearizable.
We begin by proving several lemmas that will be used to derive our main result, that uniform networks are linearizable for c 2 2c 1 . The rst lemma shows that in any counting network, when a token completed traversing the network, it has implicit knowledge about the \existence" of a certain minimum number of other tokens. Proof: The proof is by contradiction. We start by de ning the notion of events in uencing other events. For a pair of events e and e 0 in an execution E, we say that e in uences e 0 if there is sequence of events S = e 1 ; e 2 ; : : : e n such that (1) S is a subsequence of E, (2) e = e 1 and e n = e 0 and (3) for every k = 1 : : : n ? 1 if e k = hT k ; B k i and e k+1 = hT k+1 ; B k+1 i, then either T k = T k+1 or B k = B k+1 .
We now assume that there exists an execution E, in which T is the a th token to exit on Y i , but jH T j < w(a ? 1) + i + 1. We x E and construct a new execution E 0 in the following way: Let E 0 be the projection of E consisting consisting of all events involving T, and all the events that in uence these events. From the de nition of implicit knowledge, it is clear that E 0 contains events involving only the tokens found in H T during the execution. We claim that E 0 is a possible execution of the counting network in which the participating tokens and nodes cannot distinguish between E 0 and E.
We show this by induction on all the pre xes of E 0 . The base case for the empty pre x is trivial. For the inductive step we assume that the length of E 0 is positive and that the pre x of E 0 of length n ? 1, for n 1, is a possible execution of the network. We now consider the pre x e 0 1 ; e 0 2 : : : ; e 0 n of E 0 , where e 0 n = hS; Di. In the next lemma we show that if the tokens in a set K 1 enter a network N by time t and proceed according to time schedule Q 1 , and the tokens in the set K 2 enter after t, then any tokens that enter after t can only increase the number of tokens that exit on any output of any balancer B as the result of Q 1 . Proof: By induction on g. For g = 0 the lemma follows trivially from the fact that in S 1 and S 2 , by time t only the tokens in K 1 enter and they enter through the same input balancers.
Assuming the lemma holds for g, we show it holds for g + 1. Consider a node B within the layer g + 2. Since N is uniform, all of B's inputs are connected to the outputs of some balancers within the layer g + 1. By the induction hypothesis, by time t + gc 2 the number of tokens that exit on any of these outputs in S 2 is no smaller than the number that exit on the same outputs in S 1 . Since it takes at most c 2 time to traverse a wire from one layer to the next, by time t + (g + 1)c 2 the number of tokens that enter any of the inputs of B in S 2 is no smaller than the number of tokens entering the same inputs in S 1 .
In any execution, the number of tokens exiting any of the outputs of a balancer is deterministically established from the sum of the number of tokens that enter the inputs of the balancer. Since Q 1 Q 2 , for any balancer, between time t + gc 2 and t + (g + 1)c 2 there are at least as many tokens transitioning from its inputs to each of its outputs in S 2 as in S 1 . Suppose additional tokens enter the network after time t. Let S 2 be the timing schedule that describes an execution with additional tokens entering after time t. By Lemma 3.4 with g = h, for each output Y i , the new number of tokens that exit in S 2 is no smaller than the number that exit in S 1 , and is therefore at least q m i . 2
The following is our main theorem on the linearizability of uniform counting networks. Theorem 3.6 If tokens T 1 and T 2 traverse a uniform counting network of depth h during periods t 0 ; t 1 ] and t 2 ; t 3 ] respectively, in an execution in which t 1 + h (c 2 ? 2c 1 ) < t 2 , then T 2 has a higher returned value than T 1 . Proof: Suppose a i is the number of tokens that exit by time t 1 on output Y i for 0 i < w. We de ne r as follows: r = maxfi : 0 i < w^a i = maxfa j : 0 j < wgg, that is, r is the largest output index such that a r is the largest number of tokens that exit on any output.
By Lemma 3.3, there are at least m = w(a r ? 1) + r + 1 tokens that enter the network no later than time t = t 1 ? h c 1 (see Figure 3) , and T 1 is among these tokens. Let K be the set of these tokens.
By Lemma 3.5, by time t 0 = t + h c 2 = t 1 ? h c 1 + h c 2 the tokens in K exit, and for each output Y i (0 i < w) the number of tokens that exit is at least q m i .
From the fact that it takes at least h c 1 to traverse the network and because t 1 +h c 2 ?2 h c 1 < t 2 , token T 2 exits at time t 3 t 2 +h c 1 > t 1 +h c 2 ?2 h c 1 +h c 1 = t 1 + h c 2 ? h c 1 = t 0 . This means that all tokens that enter by time t = t 1 ? h c 1 exit before time t 3 . Thus, all of the tokens in K exit prior to the exit of token T 2 . Since by time t 3 the number of tokens that exit each of the outputs Y i exceeds the number of tokens q m i needed to establish the step property using m tokens, token T 2 returns a higher number than any of the tokens in K and therefore higher than T 1 Proof: Given the original network, we attach in front of each of its inputs a path of length dh (k ? 2)e of 1-input 1-output \balancers" wired one after the other. The tokens traversing such balancers simply proceed from one to the next. Proof: Let h be the depth of the tree and let " > 0 be such that c 2 = (2 + ") c 1 .
We consider an execution in which the rst two tokens, T 0 and T 1 , enter the tree at the same time t 0 (we visualize the tree on its side with its root to the left and the leaves on the right). Without loss of generality, let T 0 go up (corresponding to the root balancer transition from 0 to 1) and T 1 go down (the balancer transition from 1 back to 0), i.e., T 0 precedes T 1 . After traversing the root, T 0 proceeds at the slowest possible pace of one wire per c 2 time, while T 1 proceeds at the fastest possible pace of one wire per c 1 time. T 1 reaches the topmost leaf of the bottom subtree at time t 1 = t 0 + h c 1 and returns the value 1 (by the de nition of the counting tree and c 1 ).
Immediately after T 1 's exit, a wave of 2 h ? 1 tokens enters the tree, say at time t 2 = t 1 + > t 1 . We choose to be such that 0 < < ". These tokens proceed at the fastest possible pace of 1 wire per c 1 time. Of these tokens, 2 h?1 tokens go to the upper subtree and the remaining 2 h?1 ? 1 tokens go to the lower subtree.
Since the token T 0 is slow, it reaches a leaf at time t 4 = t 0 + h c 2 . The second wave of fast tokens reaches the leaves at time t 3 = t 2 + h c 1 = t 1 + + h c 1 = t 0 + 2h c 1 + = t 0 + h (c 2 ? c 1 ") + = t 0 + h c 2 ? c 1 h" + . Since we chose such that 0 < < ", the inequality can be further simpli ed to t 3 < t 0 + h c 2 = t 4 . Thus t 3 < t 4 and these fast tokens reach the leaves ahead of T 0 . Since we have 2 h?1 tokens in addition to T 0 traversing the top subtree, at least one token reaches the topmost leaf of the tree and returns the value 0. This token traverses the counting tree completely after T 1 exits, but returns a smaller value.
2
We now consider Bitonic networks. Proof: By induction on the width w of the network: The base case is trivial for w = 2 with a single balancer and two counters (we only need to note that outputs y 0 and y 2 are the same for this network). Assuming the lemma holds for some width w 2, we prove that it holds for networks of width 2w. The inductive step is depicted in Figure 4 , and the balancer and exit labels below refer to that gure. We use the inductive construction of Proof: In the example in Section 1 we established that a network of width 2 consisting of a single balancer and two counters is not linearizable, and it is easy to see that this is so for any c 1 and c 2 such that c 2 > 2 c 1 . Below we consider networks with w > 2. We choose "; 1 ; 2 > 0 such that 1 + 2 < ", and we let c 2 = 2 c 1 + ". Using the framework of Lemma 4.2, we deploy the three tokens T 0 , T 1 , and T 2 according to the following scenario. Starting in the initial state, we let T 0 enter via the input X 0 and completely traverse the network and exit via the output Y 0 thus returning the value 0. Following this, at some time t 1 , token T 1 also enters via X 0 , and T 2 enters via X 0 immediately behind T 1 at time t 1 + 1 for some 1 > 0. We let T 1 proceed at the slowest possible pace of 1 wire per c 2 time, while T 2 proceeds at the fastest possible pace of 1 wire per c 1 time. This means that T 1 exits at time t 0 1 = t 1 + 2h c 1 + h", and T 2 exits at time t 0 2 = t 1 + 1 + h c 1 . By Lemma 4.2, the paths that T 1 and T 2 traverse have no balancers in common, with the exception of the rst balancer in their paths. Thus, in the execution fragment that follows and does not include these tokens' traversal of the rst balancer, T 1 is not in uenced by T 2 and still proceeds to the exit Y 1 .
As soon as T 2 exits via Y 2 and obtains the counter value 2, w fast tokens enter the network at time t 3 = t 0 2 + 2 for some 2 > 0. Regardless of these tokens' paths, they exit the network at time t 0 3 = t 3 + h c 1 . Since 1 + 2 < ", these tokens exit before the slow token T 2 .
During this execution, the network is traversed by w + 3 tokens. If no other tokens enter the network, then each of outputs Y 0 ; Y 1 , and Y 2 has each two tokens that exit through it, and outputs Y 3 ; : : : ; Y w?1 each have one. Thus one of the fast tokens exits via Y 1 and because it is faster than T 1 , it obtains the counter value 1, while T 1 obtains the value 1 + w. As a result the fast token obtains a lower value than T 2 .
As we will see in the experimental results Section 5, when the ratio c 2 =c 1 increases beyond 2, the percentage of non-linearizable operations also increases. Below we show that for Bitonic networks there can be a large fraction of tokens that exhibit Proof: The Bitonic counting network 4] of width w, Bitonic w], has depth h = log w (log w+1) 2 . The network consists of two stages (see Figure 5 ). The rst stage includes two Bitonic w=2] networks of depth h 1 = h ? log w connected in parallel to the second stage that is the merging network of depth h 2 = log w, Merger w].
Merger w] consists of a row of balancers connected to two Merger w=2] mergers (for details see 11]). Note that this inductive construction of the merger is di erent from, but isomorphic to the construction in Figure 4 . The construction we use here yields a clearer proof.
A non-linearizable schedule is constructed as follows: The rst wave of w=2 tokens enters Bitonic 1 w=2] network at the same time and proceeds in lock step at some pace to the exits of the rst stage. The second wave of w=2 tokens enters the same network immediately behind the rst wave after a small delay > 0.
As soon as the rst wave enters Merger w], it slows down to the slowest possible pace of one wire per c 2 time. This wave proceeds to the Merger 1 w=2] sub-component of the merger after passing through the rst row of balancers of
Merger w].
Similarly, the second wave of w=2 tokens proceeds to Merger 2 w=2], except that it proceeds at the fastest possible pace of one wire per c 1 time. As soon as the second wave exits, a third wave enters Bitonic w] as the rst two waves.
The third wave of w=2 tokens proceeds in lock step at the fastest pace of one wire per c 1 time to the exits. Therefore this wave exits through the rst w=2 exits.
It takes the rst wave t 1 > h 2 c 2 = c 2 log w time to reach the exits. It takes the second wave t 2 = h 2 c 1 = c 1 log w time to exit. It takes the third wave t 3 2
We have shown speci c scenarios in which the violations of local timing conditions lead to non-linearizable executions in important classes of uniform counting networks. The work of Mavronicolas et al. 20] shows how violations of timing conditions lead to non-linearizability in general counting networks (see Section 6).
EMPIRICAL EVALUATION OF LINEARIZABILITY
We evaluated the linearizability of counting networks on a simulated distributedshared-memory machine similar to the MIT Alewife of Agarwal et al 1]. Alewife is a large-scale multiprocessor that supports cache-coherent distributed shared memory and user-level message-passing. The nodes communicate via messages on a two-dimensional mesh network. A Communication and Memory Management Unit on each node holds the cache tags and implements the memory coherence protocol by synthesizing messages to other nodes. Our experiments make use of the shared memory interface only. To simulate the Alewife we used Proteus 2 , a multiprocessor simulator developed by Brewer, Dellarocas, Colbrook, and Weihl 8] . Proteus simulates parallel code by multiplexing several parallel threads on a single CPU. Each thread runs on its own virtual CPU with accompanying local memory, cache and communications hardware, keeping track of how much time is spent using each component. In order to facilitate fast simulations, Proteus does not do complete hardware simulations. Instead, operations which are local (do not interact with the parallel environment) are run uninterrupted on the simulating machine's CPU and memory. The amount of time used for local calculations is added to the time spent performing (simulated) globally visible operations to derive each thread's notion of the current time. Proteus makes sure a thread can only see global events within the scope of its local time.
Implementation and experimentation methodology
For our benchmarks, we implemented the Di racting tree 24] and the Bitonic counting network 4] in shared memory. Both types of data structures gave each simulated processor with one of the two possible timing characteristics. The rst kind allowed the processors to traverse the network unimpeded. The second kind introduced a time delay following the traversal of a balancer. This delay models the network delays or additional work that a processor may need to perform. We randomly designated a fraction of the processors, all of whom were be subjected to such delays. We performed two sets of experiments. In one set of experiments, the fraction F was 25%, in the other F was 50%. For each set of experiments, the time delay is de ned via a workload variable W equal to 100; 1000; 10000; and 100000 wait cycles .
We ran the scenarios varying the number of processors from 4, 16, 64, 128, 256, and up to 440 (this upper limit is due to the speci cs of the hardware con guration we used). The execution of each simulation proceeded until each processor performed 200 operations. This number was chosen because of the long simulation times for large number of processors. (We also performed this test using 5,000 total operations). The graphs plot the non-linearizability ratio, i.e the percentage of non-linearizable operations (see De nition 2.6) among all the operations during the execution.
Every balancer was implemented as a critical section protected by a MellorCrummey and Scott (MCS) queue-lock 21] and, in the Di racting tree, using a multi-prism implementation 23]. This was done to reduce contention on the balancers which would have attenuated the in uence of the W-waiting periods on the c 2 =c 1 relation.
The pseudocode for the main component of the simulation, the operation of obtaining the \next" counter value is given in Figure 6 . This code was executed by each simulated process. SharedCounter is the concurrent counter implementation. In our simulations it was either the Bitonic counting network or the Di racting tree counter implementation. The array TotalIncrements ensured that each processor performed MaxIncrements operations. The private variable, old and new, were used to respectively remember the previous value of the counter value obtained within the process, and to store the new value. All other variables are the global simulator variables. That means that all the processes could access them atomically at no cost. Nonlin is the number of non-linearizable operations we observed.
A typical implementation of a shared-memory counter is shown in Figure 7 . We present the empirical data by charting the non-linearizability ratio as the function of the number of processors. In each of our experiments, we compute the average time it takes for a processor to traverse a balancer and a wire when the workload W = 0. We use this average as the approximation of c 1 in the presentation. Note that using such average is conservative { for example, using the minimum value for such traversal would cause an increase in c 2 =c 1 ratio and thus \excuse" or \explain" more of the non-linearizable operations observed in some scenarios. Using this de nition of c 1 , we compute c 2 as (Average-c 1 + Workload)=Average-c 1 = 1 + Workload=Average-c 1 .
The absolute values of the average c 1 vary between the Bitonic network and the Di racting tree due to the di erence in the processing time associated with the prism in the Di racting tree implementation. For ease of presentation, all data is normalized with respect to the average c 1 in the execution. To illustrate the ratio c 2 =c 1 (c 1 divided by c 2 ) we present the normalized c 2 and also the normalized standard deviation for c 1 in the form Standard-deviation / Average-c 1 .
Presentation and assessment of empirical data
The main results are presented in Figure 8 for the Di racting tree and Figure 9 In Tables 1 and 2 we give the normalized c 2 for the Di racting tree and the Bitonic network respectively. In Tables 3 and 4 we give the respective normalized standard deviations for c 1 .
Using the theoretical results and empirical data we now discuss the e ects of timing, network depth, concurrency, and asynchrony and randomization on the linearizability of the simulated execution scenarios.
The e ects of timing. As can be seen, for the lower delay workloads (W = 100 and W = 1000), the normalized c 2 is less then or close to 2, and no linearizability violations occur for 16 or more processors. For these workloads some non-linearizability is observed for small number of processors, i.e., four. Note that for the Bitonic network, the violations occur for these values of W when the normalized c 2 is above 5. Even so, the non-linearizability ratio here is less than 1%.
For higher delay workloads (W = 10000 and W = 100000), the normalized c 2 is well above 2 and for the Bitonic network it reaches several hundreds (see Tables 1  and 2 ). As expected, we observe signi cant increase in the ratio of non-linearizable operations. For the Di racting tree the ratios peak at about 26% for 16 processors 50% of which incur delays of W = 100000. For the Bitonic network the peak ratio is about 12% for the same parameters. Substantially lower peak non-linearizable ratios, of 10% and 5% respectively, are observed for F = 25% and 16 processors.
It is surprising is that despite the high c 2 , the non-linearizable token ratio falls sharply as the number of processors is increased. We examine some of the reasons for this phenomena.
The e ects of network depth. The Bitonic networks have substantially greater depth than Di racting trees of the same width. This results in many more operations overlapping in the Bitonic networks given identical token arrival schedules. With this di erentiating factor, we expect and indeed observe substantially fewer linearizability violations in the Bitonic network simulations as compared to the Di racting tree simulations. This padding e ect is also suggested by Theorem 3.6 that enables, for a known c 2 =c 1 > 2, the construction of a linearizable networks by extending the depth of any known counting network.
The e ects of concurrency. There are simple scenarios that, using as few processors as 2, produce high levels of non-linearizability. Recall our example in Section 1, in which three tokens caused one non-linearizable operation. Let processor P 0 be the owner of the token T 0 and processor P 1 be the owner of tokens T 1 and T 2 . If the Table 4 . Standard deviation normalized for average c 1 for the simulations of Bitonic networks. token T 0 is very slow, so that it does not exit the network for a long time, then any sequence of tokens T i generated by P 0 will have each of its even-numbered tokens T 2j return lower counter values than its odd-numbered tokens T 2j?1 for j > 0. This is because the even-and odd-numbered tokens traverse the network sequentially. If there were three processors, such that T 2j is concurrent with T 2j?1 , then the there would be no nonlinearizable operations.
Although far from a complete characterization, the above observation of linearizability versus concurrency provides intuition for why there is a dramatic reduction, at high concurrency, in the number of non-linearizable operations for both the Di racting tree and the Bitonic network.
Of course the counting network approach is optimized for high concurrency, so it is not surprising that deploying counting networks in low-concurrency setting has its drawbacks. For few processors, there are more e cient and linearizable solutions 14] .
The e ects of asynchrony and randomization. We also tested the linearizability of our implementation when either all or no tokens were delayed, i.e., the cases of F = 0% and F = 100%, and/or when the additional delays were eliminated, i.e., W = 0. In none of these simulation were there any non-linearizable operations.
Although not surprising { these scenarios create timing schedules close to those of an implementation that is synchronous { we performed these simulations for completeness.
In another simulation scenario we forced every token to wait a random number of cycles between 0 and W. Again, the simulation was observed to be completely linearizable. Randomization apparently has attenuating e ect that prevents consistent accumulation of timing discrepancies by faster or slower tokens. we show here).
We have considered local timing characteristics at balancers. The linearizablility question can also be posed in terms of global timing characteristics, i.e., in terms of the minimum and maximum time it takes a token to traverse the entire network and without the restriction on the time to traverse each individual balancer. Our examination of Counting trees and Bitonic networks shows that violations of required local conditions lead to non-linearizable executions (this is also shown for general networks in 20]). In these executions we use tokens that traverse a network at the fastest and the slowest possible paces. The fast tokens \bypass" the slow tokens only at the exits. Therefore even if the required conditions are global, our scenarios still yield non-linearizable executions.
There are many other variations of the timing model which one may investigate. However, we feel the most interesting direction to follow at this time is the characterization of applications that do not have an absolute requirement for linearizability, that is, ones requiring that only a given fraction of the operations be linearizable.
ACKNOWLEDGMENTS

