In their SIAM J. on Computing paper [33] from 1992, Martel et al. posed a question for developing a work-optimal deterministic asynchronous algorithm for the fundamental loadbalancing and synchronization problem called Certified Write-All. In this problem, introduced in a slightly different form by Kanellakis and Shvartsman in a PODC'89 paper [20] , p processors must update n memory cells and only then signal the completion of the updates. It is known that solutions to this problem can be used to simulate synchronous parallel programs on asynchronous systems with worst-case guarantees for the overhead of a simulation. Such simulations are interesting because they may increase productivity in parallel computing since synchronous parallel programs are easier to reason about than asynchronous ones are. This paper presents the first solution to the question of Martel et al. Specifically, we show a deterministic asynchronous algorithm for the Certified Write-All problem. Our algorithm has the work complexity of O(n + p 4 log n). This work complexity is asymptotically optimal for a nontrivial number of processors p ≤ (n/ log n) 1/4 . In contrast, all known deterministic algorithms require superlinear in n work when p = n 1/r , for any fixed r ≥ 1. Our algorithm generalizes the collision principle introduced by Buss et al. [8] in 1991, that has not been previously generalized despite various attempts. Each processor maintains a collection of intervals of {1, 2, . . . , n}. Any processor iteratively selects an interval, and works from its tip towards the other tip, until it finishes the work or collides with another processor. Collisions are detected effectively using a special Read-Modify-Write operation. In any case, the processor transforms its collection appropriately. Our analysis shows that the transformations preserve some structural properties of collections of intervals. This guarantees that work is assigned to processors in an efficient manner.
Introduction
This paper shows a deterministic algorithm where p asynchronous processors update n cells of shared memory and only then signal the completion of the updates. The algorithm has asymptotically optimal work complexity of O(n) for a nontrivial number of processors p ≤ (n/ log n) 1/4 . This result is the first solution to the question posed by Martel et al. [33] in 1992. Many existing parallel systems are asynchronous. However, writing correct parallel programs on an asynchronous shared memory system is often difficult, for example because of data races, which are difficult to detect in general [7, 38] . When the instructions of a parallel program are written with the intention of being executed on a system that is synchronous, then it is easier for a programmer to write correct programs, because it is easier to reason about synchronous parallel programs than asynchronous ones. Therefore, in order to improve productivity in parallel computing, one could offer programmers the illusion that their programs run on a parallel system that is synchronous, while in fact the programs would be simulated on an asynchronous system.
Simulations of a parallel system that is synchronous on a system that is asynchronous have been studied for over a decade [3, 4, 5, 6, 10, 14, 16, 20, 22, 23, 24, 25, 30, 31, 32, 33, 35, 40, 41] . Simplifying considerably, such simulations assume that there is a system with p asynchronous processors, and the system must simulate a program written for n synchronous processors. The simulations use three main ideas: idempotence, load balancing, and synchronization. Specifically, the execution of the program is divided into a sequence of phases. A phase executes an instruction of each of the n synchronous programs. A phase is divided into two stages. First the n instructions are executed and the results are saved to a scratch memory. Only then cells of the scratch memory are copied back to desired cells of the main memory. This ensures that the result of the phase is the same even if multiple processors execute the same instruction in a phase, which may happen due to asynchrony. The p processors run a load balancing algorithm to ensure that the n instructions of the phase are executed quickly despite possibly varying speeds of the p processors. In addition, the p processors are synchronized at every stage (twice per phase), so as to ensure that the simulated program proceeds in lock-step. Such simulation implements the PRAM model [15] on an asynchronous system.
One challenge in realizing the simulations is the problem of "late writers" i.e., when a slow processor clobbers the memory of a simulation with a value from an old phase. This problem has been addressed in various ways: by replication of variables [23] ; by a combination of hashing, replication, and error correction [4] ; by approximate detection of who is late, and replication of variables [5] ; by using instructions that execute relatively fast [33] ; by versioning of variables using extra atomic primitives [34] ; or by restricting a class of computations that can be simulated [33] .
Another challenge is the development of efficient load-balancing and synchronization algorithms. This challenge is abstracted as the Certified Write-All (CWA) problem. In this problem, introduced in a slightly different form by Kanellakis and Shvartsman [20] , there are p processors, an array w with n cells and a flag f , all initially 0, and the processors must set the n cells of w to 1, and only then set f to 1. One efficiency criterion for the simulation is to reduce the wasteful use of computing resources. This use can be abstracted as the work complexity (or work for short) that is equal to the worst-case total number of instructions executed by the simulation. A simulation uses an algorithm that solves the CWA problem. Therefore, it is desirable to develop low-work algorithms that solve the CWA problem.
When creating a simulation of a given parallel program for n processors, one may have a choice of the number p of simulating processors. On the one hand, when a CWA algorithm for p n is used in a simulation, the simulation may be faster as compared to the simulation that uses an algorithm for p n processors, simply because of higher parallelism which means that more processors are available to perform the simulation. On the other hand, however, processors that access shared memory may create hotspots, which may cause delays, and as a result an algorithm for p n may in fact run slower than an algorithm for p n (memory contention is disregarded in the model studied in this paper). The actual speed of a simulation may depend on system parameters, and so it is interesting to study CWA algorithms for different relationships between p and n.
The best known randomized algorithm that solves the CWA problem on an asynchronous system was given by Martel and Subramonian [36] . Their algorithm has expected work of O(n) when p ≤ n/ log n, and expected work of O(n log n) when p = n. They also showed a lower bound of Ω(n + p log p) on expected work of Las Vegas CWA algorithms against an oblivious adversary.
Deterministic algorithms that solve the CWA problem on an asynchronous system can be used to create simulations that have bounded worst-case overhead. Thus several deterministic algorithms have been studied [1, 8, 9, 18, 21, 37] . Fixing r ≥ 1, when p = n 1/r all these deterministic algorithms have work ω(n). Specifically, when r = 1 the first asynchronous CWA algorithm, called X, was developed by Buss et al. [8] . This algorithm was later generalized by Anderson and Woll. Using a lower bound on contention of permutations (a value related to the number of left-to-right maxima in the permutations, see [1] for a formal definition) due to Lovász [29] and Knuth [27] , Malewicz [31] showed that the generalized algorithm, called AWT, has work Ω(n 1+ deterministic algorithms require as much as ω(n) work when p = n 1/r , for any fixed r ≥ 1. The processors use O(n + p 4 log n) memory cells for coordinating their work (shown in Theorem 3.16). Our algorithm generalizes the collision principle used by the algorithm T. Namely, each processor has a collection of intervals of w and iteratively selects an interval to work on. The processor proceeds from one tip of the interval towards the other tip. When processors collide, they exchange appropriate information and schedule their future work accordingly. Our algorithm uses a special atomic Read-Modify-Write instruction to detect collisions. Such strong primitives were not used by previous algorithms, except for the algorithm of Groote et al. [18] . Our paper contributes to solving the problems posed by Martel et al. [33] and by Buss et al. [8] .
Subsequent work. Subsequent to the conference version of this paper [30] , Kowalski and Shvartsman presented [26] a deterministic asynchronous algorithm for the Certified Write-All problem. Their algorithm has asymptotically optimal work when the number of processors is p < n 1/(2+ ) . This range is significantly wider than the range of p where the algorithm presented here is proven to have asymptotically optimal work. However, it is not clear that our upper bound on work is tight. It would be interesting to develop tight bounds on work for each algorithm and compare the bounds. The algorithm of Kowalski and Shvartsman uses atomic reads and writes, while our algorithm requires a much stronger primitive of RMW. Their algorithm uses a collection of q permutations with contention O(q log q), while it is not know to date how to construct such permutations in polynomial time. Thus their result is so far existential, while ours is explicit.
Paper organization. The remainder of the paper is organized as follows. In Section 2, we define the asynchronous shared memory model of computation used in the paper and the Certified Write-All problem. In Section 3, we present our deterministic algorithm and its analysis. Finally, in Section 4, we conclude with future work.
Model and definitions
We consider a shared memory system where processors can work at arbitrarily varying paces. Our formal definition is based on the Atomic Asynchronous Parallel System as presented by [5] (cf. [2, 11, 12, 13, 17, 28, 33, 39, 42] ).
The system consists of p processors, each of which has a dedicated local memory, and every processor has access to shared memory. Any memory is composed of cells. The initial section of n cells of shared memory stores an array w[0, . . . , n − 1]. The subsequent cell stores a flag f . Any cell of any memory can store any O(log n)-bit number. Any processor has a distinct identifier from {1, . . . , p}.
Each processor has a discrete local clock ranging over N = {1, 2, 3, . . .}. A processor executes exactly one basic action at any tick of the local clock unless the processor has halted. The basic actions that a processor can execute are: a Halt action that stops the operation of the processor, any operation on a constant number of cells from the local memory, and a transfer between the local memory and shared memory. The possible transfers are: reading a single cell of shared memory into a cell of the local memory; writing from a cell of the local memory to a cell of shared memory; and performing a special Read-Modify-Write (RMW) action that compares the value stored at a cell of shared memory with the value stored at a cell of the local memory, and if they are equal, the action transfers a constant number of cells from the local memory to a constant number of cells of shared memory, but in any case returns the result of the comparison (see Figure 1 , and also an example of syntax in Figure 2 ).
An execution of an algorithm progresses according to the following model of asynchrony. Local time of processor i is mapped to global time through a strictly increasing function T i : N → R. We assume that no local clock ticks of two processors are mapped to the same instant of global time i.e., if T i (x) = T j (y), then i = j and x = y. A tuple T 1 , . . . , T p with mappings that satisfy these conditions is called a valid tuple of mappings. When a valid tuple T 1 , . . . , T p has been fixed, each processor executes basic actions dictated by its algorithm. The processors take turns according to the total order prescribed by the mappings. Any processor i does not execute basic actions after the tick when the processor executed the Halt action, if the processor executed the action. The execution of any basic action is instantaneous, and so the resulting memory updates are atomic.
We adopt the following definition of the Certified Write-All (CWA) problem: given the array w[0, . . . , n − 1] with n cells and the flag f , all initially 0, set the n cells of w to 1, and only then set f to 1. An algorithm solves the CWA problem for p processors and n cells, if for any valid tuple T 1 , . . . , T p of mappings, the following three conditions hold:
(i) (Termination) each processor halts after a finite number of local clock ticks,
(ii) (Certification) when any processor halts, the flag f has been set to 1, (iii) (Validity) when the flag f is set to 1, all cells of w have been set to 1.
The work complexity of a deterministic algorithm that solves the CWA problem for p processors and n cells measures the maximum total number of basic actions executed by the processors. Consider any valid tuple T 1 , . . . , T p of mappings. Let h i be the first local clock tick when processor i executes the Halt action, or ∞ if it does not execute the action. Then the total number of basic actions executed by the processors is Note that in this model, there is a trivial Write-All algorithm for n = p where the first basic action that a processor i, 0 ≤ i ≤ n − 1, executes is an assignment of 1 to cell i of the array w (because the model ensures that each processor will eventually perform a basic action). This takes O(n) work in total. However, in general, no processor can certify and halt right after performing its first basic action without violating the validity condition. The processor simply cannot always ensure that each of the n cells has been set to 1, due to the fact that other processors may be delayed. 
Collision algorithm and its analysis
This section presents a deterministic algorithm for the Certified Write-All problem with asynchronous processors (see Figure 2 ). The algorithm generalizes the collision principle of algorithm T. The main algorithmic approaches of our algorithm are: to ensure that any processor often works on a relatively large interval of unset cells of the array w, according to a sequence that enables rapid detection of two processors setting to 1 the same cell of the array; and, when redundancy occurs, to ensure effective mechanism of reassigning work to processors. Briefly speaking, all processors share an array tab with n cells used for coordination of their work. Each processor maintains a collection of intervals of the set {0, 1, . . . , n − 1} (an interval is a subset of consecutive elements of the set). A processor takes an interval from the collection and keeps setting cell w[x] to 1 and storing some special information in cell tab [x] , while working through cells x from a tip of this interval towards the opposite tip. Later, the processor removes some intervals or their parts from the collection, possibly based on information obtained from other processors. This process is repeated as long as there is an interval in the collection. When the collection becomes empty, then the processor sets the flag f to 1 and halts.
There are several challenges that we solve to ensure that our algorithm avoids doing too much redundant work. It may happen that two processors "collide" at the same cell while working in opposite or the same directions. When they work in opposite directions, then it could happen that they "cross" each other and duplicate the work that the other processor already did. When they work in the same direction, then they may keep on working "side-by-side" and again duplicate the work that the other processor is doing. Another potential problem is that even if we are able to detect collisions, then a processor that collides must decide upon a cell of the array where the processor will resume its work from. Ideally, the processor should choose to work from a tip of an interval so that this tip is "far away" from any cell that any other processor is currently working on. This is desirable because it would help to ensure that when the next collision of this processor occurs, substantial number of distinct new cells of the array w have been set to 1.
Intuitively, our algorithm solves these challenges as follows. The processors coordinate their work on intervals using atomic Read-Modify-Write (RMW) instructions. This ensures that whenever a processor does a successful RMW to a cell, no other processor can succeed. As a result, a colliding processor sets at most one cell of w to 1 before it detects a collision with other processor, and has an opportunity to reassign its own future work. The choice of a relatively long interval located in a rather unassigned part of the array is intuitively done by a processor always working on an interval that is at least as long as half of the length of a longest interval in the collection. In addition, we ensure that a colliding processor obtains knowledge from the other processor, about which cells of w remain to be set to 1, and this allows us to guarantee that when a processor often collides, it must substantially reduce the amount of work that it "thinks" that remain to be done, even though it has not actually recently set to 1 any distinct cells of w.
The following sections present details of this intuitive explanation. We begin, in Section 3.1, with an overview of the collision algorithm to help the reader gain familiarity with the code of the algorithm and how the algorithm works. In the algorithm, processors maintain collections of intervals and they sometimes incorporate knowledge from collections of other processors. Then, in Section 3.2, we detail the operations on collections of intervals that processors use. The operations incorporate knowledge so that the resulting collections of intervals are "well-behaved" in a precise sense. Finally, in Section 3.3, this good behavior of operations helps us prove a bound on work for our algorithm.
Overview of the collision algorithm
We outline the collision algorithm. We begin with the shared and the local memory variables that processors access, and then describe how the algorithm works. Recall that n denotes the number of cells in the Write-All array and p denotes the number of processors. We assume that n and p are powers of two. Line numbers mentioned in this section refer to the code in Figure 2 .
Each processor has access to three shared variables: the completion flag f , the Figure 3 , part a). The first field is a bit equal to either L or R. The second field is a collection of intervals of {0, . . . , n − 1}. As we will see later, at most p/2 intervals are ever stored in this field. The third field is an interval of {0, . . . , n − 1}. Note that each interval can be represented as two numbers -the interval tips. Hence, any cell of tab contains O(p log n) bits. We will later see how to use a pointer representation of collections, so as to ensure that each cell of the trace table stores only O(log n) bits. For now, we use the "expanded" representation for simplicity of exposition. We assume that when processors begin execution, these shared variables are initialized as follows: f and cells of w to zero, and every cell of tab to L, ∅, ∅ (i.e., tab [x] .dir = L, tab [x] .U = ∅, and tab [x] .D = ∅, for any 0 ≤ x ≤ n − 1).
Each processor has a few local variables. The variable U contains a collection of intervals of {0, . . . , n − 1}. As we will see later, U can contain at most p/2 intervals at any time during execution. These intervals cover all cells that remain to be written to, and so any cell that is not in one of the intervals must necessarily have already been written to. However, some cells covered by intervals from U may already be written to, because of the concurrent work of other processors, and possibly delayed dissemination of knowledge. Each tip of any interval in U has a flag equal to marked or unmarked. The processor has other local variables: U is a collection of intervals of {0, . . . , n − 1} (again it will contain at most p/2 intervals); D and D are intervals of {0, . . . , n − 1}; c, s, e, x are integers from {0, . . . , n − 1}; dir, dir and f ailed are bits.
Note that the algorithm is uniform (i.e., it is the same for each processor), so we describe it for a given processor i. The processor begins by setting its collection U to a set that contains just one interval [0, n − 1] (line 01). The tips of this interval are unmarked. Note that then U contains an interval with an unmarked tip. Next the processor enters a big while loop (lines 02 to 29). In general, every time the processor starts an iteration of the while loop (line 03), the following loop invariant holds: the collection U contains at least one interval with an unmarked tip; the other tip and some tips of other intervals, if any, may be marked. The body of the while loop has several sections of code each carrying out a specific function.
The processor selects an interval from U (lines 03 and 04). The interval is chosen so that it has an unmarked tip s, and e is the other tip of the interval. The processor will attempt to work on the interval from s to e. The relationship between s and e determines the direction of work: either L, meaning left, or R, right.
Then the processor works on the interval by iterating through cells x from tip s to tip e (lines 05 to 10). At the end of each iteration, the interval D contains the cells through which the processor has iterated so far. At every iteration of the loop, the processor writes to the cell w[x] of the WriteAll array, and "leaves a trace" of its work inside a cell tab [x] of the trace table (see animation in Figure 3 , part a). Writing to cell tab [x] is done using the Read-Modify-Write atomic operation, so as to ensure that no two processors succeed in writing to a cell of tab (lines 30 to 35). Specifically, the write tests the value of the third field of the cell, and only if it is equal to the empty set, the write proceeds by setting the field to a value of D ∪ {x} that is always different from the empty set. The interval D ∪ {x} contains all cells to which the processor has successfully RMW during the work on the interval. In addition to modifying the third field, a successfully performed RMW stores in tab[x] the direction dir of work and the collection U . Two events can happen during the time when the processor iterates from s to e: 1) either the processor reaches the tip e of the interval because it has successfully RMW to all cells of the interval, or 2) the processor fails on a RMW operation, because earlier some processor j successfully RMW to a cell of the interval; if this happens we say that processor i collides with processor j. The actions of processor i depend on whether i collided or not.
When processor i has not collided, it records recent progress (lines 11 and 12). Specifically, then the interval D is equal to the interval from U with tips s and e on which the processor has just finished working. The processor removes this interval from U (see Figure 3 , part b).
Slightly more complicated operations are performed when processor i has collided. In such a case, processor i will incorporate the knowledge gained from the processor j with whom the collision has occurred. The incorporation of knowledge proceeds in several stages.
First processor i retrieves information about processor j from the trace table (lines 13 and 14). Recall that a collision occurs when a processor fails on a RMW to a cell tab [x] of the trace table. This means that a processor j has already performed a successful RMW to the cell. This cell must contain the collection U of intervals that processor j had at the time when it performed a successful RMW to the cell tab [x] . The cell also contains the direction dir in which processor j was working at that time, and the part D of the interval that processor j had successfully worked on until the time it performed the RMW. So processor i can retrieve these three pieces of information from the trace table by reading tab[x] (see Figure 3 , part c). We denote the pieces by U , dir and D . Note that the local variables U , dir and D of processor j at the time when i retrieves information from the cell tab[x] may be different from U , dir and D , because j might have executed many instructions since it performed RMW to tab [x] .
Second, processor i intersects the two collections of intervals, its own U with the collection U of processor j (lines 15 and 17). A collection with the longest interval is taken, and its intervals 
work from tip s to tip e of the interval
head-on-head collision, record combined progress
head-on-back collision, record progress and mark a tip are trimmed by the union of the intervals from the other collection. A detailed specification of the intersection operation is given in Lemma 3.3. As a result of the intersection, U contains some intervals. Then the actions that processor i takes, depend on whether the colliding processors i and j worked in the same, or opposite directions.
If they worked in opposite directions, then processor i records progress D and D (lines 18 and 19). It removes from U all intervals, or their parts, that are contained in either the interval D that i successfully worked on, or the interval D that j successfully worked on (see Figure 4 , part a). A detailed specification of the removal operation is given in Lemma 3.5.
If, on the other hand, processors i and j worked in the same direction, then progress D and D is recorded in a different way (lines 20 to 23). Two kinds of a "head-on-back" collision are distinguished. The first kind is when processor i failed on its first RMW in the work on the interval from s to e, and the second kind is when processor i succeeded on the first RMW and so "bumped into the back" of the trace of processor j later during the work on the interval. When processor i succeeded on the first RMW, then D = ∅ (line 21). Here processor j cannot have arrived at x "from behind" of processor i, and so j must have just started working on its interval when it RMW to cell tab[x] (see Figure 4 , part b). Then processor i removes from U the interval D that it has just successfully worked on (line 21). A detailed specification of the removal operation is given in Lemma 3.7. As a result of the removal, cell x becomes a tip of an interval in U . When, on the other hand, processor i failed on the first RMW, then D = ∅ (line 22). Here processor j must have successfully worked on at least one cell, but possibly more (see Figure 4 , parts c). Processor i removes from U the interval D that j worked on, except for the cell x where collision took place (line 22). A detailed specification of the removal operation is given in Lemma 3.9. No matter which of the two kinds of "head-on-back" collisions occurred, as a result x is a tip of an interval in U , and processor i marks the tip (line 23). Marking allows us to keep track of the tips where head-on-back collisions took place. A marked cell is known to have been set to 1.
In any case, whether processor i collided or not, the processor then checks if all cells have been written to (lines 24 and 25). As the algorithm iterates, intervals of length 1 with a marked tip may emerge in the collection U . Such intervals are removed from U , as we are certain that the corresponding cells must have been written to. If U does not contain any more intervals, then the processor certifies and halts. As a result of the removal, any interval with a marked tip has length at least 2.
The final action performed by processor i in the body of the while loop is a possible partition of an interval (lines 26 to 29). If all tips of all intervals in U are marked, then the processor takes a longest interval in U , and partitions it into two halves. The two tips that are being exposed are unmarked. This ensures that there is at least one interval in U with an unmarked tip. So the while loop invariant holds again.
Collections of intervals, their transformations, and preserved properties
During the time when a processor executes the collision algorithm, the processor interacts with other processors through the trace table. As a result, the processor transforms its collection U in various ways. Lines 11 through 29 of the algorithm list the conditions under which the transformations are performed. The current section is devoted to defining these transformations and demonstrating their affect on U . We first introduce the notions of "regularity" and "monotonicity" of collections of intervals. We then show that the transformations maintain regularity and monotonicity of U , and that the number of elements contained in the intervals of U gets reduced rapidly under certain conditions. These observations will be useful when reasoning about correctness and work of the algorithm.
We define regularity and monotonicity of collections of intervals (see Figure 5 for illustration). All intervals in this and subsequent sections are over the set of integers {0, 1, . . . , n − 1}. Figure 5: This figure illustrates the definition of regular and monotonic collections of intervals. In the example above we have three collections U 1 , U 2 , and U 3 of intervals of cells. These collections are regular for the following reasons. Any collection has disjoint intervals whose lengths are powers of 2, and the lengths of intervals in any collection differ by the factor of 2, at most (they do not differ for U 1 , but differ for U 2 and for U 3 ). Any two intervals from any two distinct collections are either disjoint, or one is contained in the other i.e., there are no partial overlaps. If one is contained in the other, then the subset must be properly aligned inside the superset i.e., the superset can be partitioned into intervals of the same length, the number of these intervals is a power of two, and the subset is one of the intervals (e.g., I 2,1 can be partitioned into 2 intervals, the right of which is I 3,1 ). Note that collections U 1 , U 2 are monotonic, because for each interval of U 2 there is a superset interval in U 1 . On the other hand, collections U 2 , U 3 are not monotonic, because there is no interval in U 2 that contains interval I 3,2 .
(iv) for any j, 0 ≤ j < g, if interval I belongs to U j+1 then there is interval J that belongs to U j such that J ⊇ I.
We now show a simple fact that halving any interval in the last collection preserves monotonicity. Taking a specific intersection of two collections preserves regularity and monotonicity, as shown below. In this intersection, a collection with the longest interval is taken, and its intervals are trimmed by the union of the intervals from the other collection. Proof. We shall study the content of V based on the relationship between the length of a longest interval in U and the length of a longest interval in U . In preparation for the case analysis we record what the lengths of intervals in these collections may be. Since collections are regular, by property (i), we indeed know that the length of a longest interval in U is 2 k and the length of a longest interval in U is 2 k , for some integers k, k ≥ 0. We also know that intervals in U have length 2 k or 2 k−1 , and the intervals in U have length 2 k or 2 k −1 .
Then the number of intervals in
For the first case suppose that k > k , and let us investigate common parts between U and U . Let I be an interval from U , and J from U . The interval I cannot be shorter than 2 k−1 , and the interval J cannot be longer than 2 k−1 . Therefore, J is too short to be a strict superset of I. Hence, by property (ii), any interval J from U is either a subset of some interval from U or does not intersect with any interval from U . Consequently, the set U ∩ U contains only some complete nonempty intervals from U . The length of a longest interval in U ∩ U is reduced by a factor of 2 or more compared to the length of a longest interval in U . Since removing an interval from a collection does not invalidate the properties, the collections V,U 0 1 , . . . , U i , V are monotonic. The second case is when k < k . We can carry out an analysis similar to the one presented in the previous paragraph. The set U ∩ U contains only some complete nonempty intervals from U , and regularity and monotonicity hold. However, we do not guarantee that the length of a longest interval in U is reduced, because it could happen that any interval from U is included in an interval from U .
The final case is when k = k . Let us again investigate the result of the operation U ∩ U . Take any interval I from U , and let us see what part of this interval is contained in U ∩ U , if any. By property (ii), for any interval J from U we have: either I ⊆ J, or I ⊃ J, or I ∩ J = ∅. If the first subcase occurs, then we are guaranteed that complete I is contained in U ∩ U . Suppose that the first subcase does not happen, and so for all J either I ⊃ J or I ∩ J = ∅ (but never I ⊆ J). The result now depends on how many distinct J there are that satisfy I ⊃ J. Suppose that I ⊃ J. Our assumption about the length of intervals ensures that the length of such J is 2 k−1 and of I is 2 k . By property (i), we can have either zero, or one, or two intervals in U that are strict subsets of I. In the former situation all intervals J have empty intersection with I, and so I is not contained in U ∩ U . In the later situation the two intervals combined must yield I, and so complete I is contained in U ∩ U . The discussion presented in this paragraph so far implies that regularity and monotonicity trivially hold because the intervals contained in U ∩ U are complete intervals from U . In the middle situation, by property (iii), the interval I is partitioned into two halves: J and I \ J, and so only the half J is contained in U ∩ U , but not the other half. An argument similar to that in Lemma 3.1 shows that monotonicity is preserved and similar to that in Lemma 3.2 shows that regularity is preserved.
The operation of intersection defined in the preceding lemma yields a certain reduction of size or length of the intervals in the resulting collection V , compared to the given collection U . Proof. To prove the lemma, we take any interval K from V and argue about what the result of subtracting D ∪ D from K is. We arrange the argument in 3 cases by the relationship between the length of a longest interval in U and in U .
For the first case suppose that k > k . Then, by Lemma 3.3, the collection V contains only some complete intervals from U . Inspecting the possible lengths of intervals from U and U reveals that it cannot happen that an interval from U is a strict subset of some interval from U , and so, by property (ii), J ⊆ I (recall that we assume that I ∩ J = ∅). W is equal to V except for possibly some complete intervals removed, and so desired regularity and monotonicity hold.
A symmetric case is when k < k . Now collection V contains only some complete intervals from U , and it must be that I ⊆ J, and that for any
is either empty or equal to K, and so desired regularity and monotonicity hold.
Finally, consider the case when k = k . Since I ∩ J = ∅, by property (ii), we have three subcases J ⊂ I, I ⊂ J, I = J. We consider them in turn. For the first subcase suppose that J ⊂ I. Since, by property (i), the length of I and J can be either 2 k or 2 k−1 , the length of I is 2 k and, by property (iii)
is either K or an empty set. Thus in the first subcase the set K \ (D ∪ D ) has length either 2 k or 2 k−1 or 0 and is equal to K, or a half of K, or the empty set. Hence desired regularity and monotonicity hold. For the second subcase suppose now that I ⊂ J. Then J has length 2 k , I is its half of length 2 k−1 , and D ∪ D is equal to J or I. Take any K from V . Since K is an interval from U or its half and has length 2 k or 2 k−1 , K is too short to be a strict superset of J, and, by property (ii),
Thus in the second subcase the set K \ (D ∪ D ) has length either 2 k or 2 k−1 or 0. Hence desired regularity and monotonicity hold. Finally, consider the last subcase when I = J. Then the set K \ (D ∪ D ) is either empty or equal to K, and so it has length either 2 k , or 2 k−1 , or 0, and so desired regularity and monotonicity hold.
Corollary 3.6. If U = V , then the total number of elements in the intervals of W is at most the total number of elements in the intervals of U minus half of the length of a longest interval in U , i.e., |∪ I∈W I| ≤ |∪
Proof. If U = V then the I used in the statement of Lemma 3.5 belongs to V . Consequently, one of the sets K from the statement of Lemma 3.5 is equal to I. We now follow the last three paragraphs of the proof of Lemma 3.5 to see what the difference is between V and W . It cannot be that k > k because then U = V . If k < k then the set D ∪ D used in the statement of Lemma 3.5 contains I, and so the collection W does not contain the interval I, which has length at least half of the length of the longest interval is U . If k = k then: when J ⊂ I then I has the length of a longest interval in U and the set D ∪ D is at least a half of I, which is removed; when I ⊂ J then D ∪ D contains I, and so I is removed; when I = J then I is removed. Proof. We begin by showing that there are just two cases to consider. Since I ∩ J = ∅, by property (ii), we know that either I ⊆ J or J ⊂ I. Suppose that I ⊆ J. Then, when D is a prefix of J, x must be the smallest element in J, and so x − 1 does not belong to J and so cannot belong to I either, while we know that x belongs to I, or when D is a suffix of J, x is the largest element in J and so x + 1 does not belong to J and so cannot belong to I either. Hence the assumption that I ⊆ J leads to a contradiction. Consequently it must be that J ⊂ I, and so either k = k or k > k . For the first case suppose that k > k . Then V contains only some complete intervals from U . Take any K from V . By property (ii), we have one of the three subcases:
Then the number of intervals in Q is at most the number of intervals in
The interval K is too short to be a strict superset of I, so the middle subcase cannot happen. If the last subcase happens, then K \ D = K. Let us now focus on the first subcase when K ⊆ I.
Notice that the sets D, J, and I \ (D ∪ J) are disjoint intervals and their union is I. By property (i), either
while when K does not intersect with J, then K \ D is either empty or equal to K. Consequently the collection Q contains only some complete intervals from V , and so desired regularity and monotonicity hold.
Finally, consider the second case when k = k . Take any K from V . The collection V contains some complete intervals from U or halves of some other intervals from U , but never two halves of any interval from U . Hence, by property (i), either K ∩ I = ∅, or K = I, or K is a half of I, but then there is no other interval in V that is equal to the other half of I. In the first subcase K \ D = K, and so let us focus on the two remaining subcases. Note that the length of J can be either 2 k or 2 k−1 , but since J ⊂ I then the length of J must be 2 k−1 , and J must be a half of I. As a result, D and J are the two halves of I. So if K = I then K \ D is equal to J, a half of K. If K is a half of I, then K \ D is either K or empty. Again regularity and monotonicity hold for Q.
Corollary 3.8. If U = V , then the total number of elements in the intervals of Q is at most the total number of elements in the intervals of U minus half of the length of a longest interval in U , i.e., |∪ I∈Q I| ≤ |∪
Proof. As explained in Corollary 3.6, we have that k ≤ k and I is in V , and so I = K for some K from the statement of the Lemma 3.7. It cannot be that k < k because this is disallowed by the proof of Lemma 3.7. Thus the only possible relationship between k and k is that k = k . In this case I has length 2 k and a half of I is removed. 
Lemma 3.9. Let x be the smallest element in I, D = ∅ a prefix of J such that x is the largest element in D , or vice versa x be the largest element in I, D = ∅ a suffix of J such that x is the smallest element in D . Let R be the collection
R = V \ (D \ {x}) := { H | H = ∅ ∧ K ∈ V ∧ H = K \ (D \ {x}) } .
Then the number of intervals in R is at most the number of intervals in
Proof. The argument is similar to that of Lemma 3.5: we take an interval K from V and argue what part of the interval is in R. We start with an observation that when D contains just one element x then the result is trivial because R = V . Assume, therefore, that D contains at least two elements. But then J contains an element that is not in I and so, by property (ii), I ⊂ J. Consequently, we have just two cases to consider k = k and k < k . We will study them in turn. For the first case suppose that k = k . Take any K from V . The collection V contains some complete intervals from U or halves of some other intervals from U . Hence, by property (i), either K ∩ I = ∅, or K = I, or K is a half of I. Note that the length of K can be either 2 k or 2 k−1 and the length of I is 2 k−1 , so K cannot be a half of I. As a result, the last subcase does not happen and we have either
can be either K or empty. Again regularity and monotonicity hold for R.
Finally, consider the second case when k < k . This case is very similar to the case when k > k in the proof of Lemma 3.7. We shall show it here for completeness. Then V contains only some complete intervals from U . Take any K from V . By property (ii), we have one of the three mutually exclusive subcases: 
is either empty or equal to K. Consequently, the collection R contains only some complete intervals from V , and so desired regularity and monotonicity hold.
Corollary 3.10. If U = V and R = V , then the total number of elements in the intervals of R is at most the total number of elements in the intervals of U minus half of the length of a longest interval in U , i.e., |∪ I∈R I| ≤ |∪
Proof. Since U = V then each interval K from the statement of the Lemma 3.9 has length at least half of the length of a longest interval in U . But R = V and so at least one interval or its part must have been removed. Lemma 3.9 shows that either a complete interval is removed or not a part of it at all. Thus at least one K is missing in R compared to V .
Analysis of the algorithm
This section presents an analysis of the collision algorithm given in Figure 2 . Line numbers refer to the code in the figure. Recall that n and p are powers of 2. Without loss of generality, we assume that Read-Modify-Write can transfer O(p) cells between local and shared memory (this assumption can be easily relaxed to comply with our model by first transferring the O(p) cells to a dedicated region of shared memory and then making RMW store a pointer to this region in a cell of the trace table, see proof of Theorem 3.16 for details).
The basic idea of the proof is to "convert" any asynchronous execution of the algorithm into collections of intervals, and then reason about the collections. Any processor iterates through the big while loop (lines 03 to 29) possibly many times. A brief inspection of the code of the algorithm reveals that at the beginning of each iteration, the collection U that the processor maintains in its local memory contains intervals of cells of the Write-All array and that one of the intervals has an unmarked tip. Intuitively, the cells contained in the intervals are the only ones that may still need to be written to, because all other cells have already been written to. An external observer can record (or "remember") the value of the collection U of each processor as the processors iterate, and then reason about the properties of all recorded collections, to conclude that a certain bound on work must hold.
Formally, each time a processor i executes line 03, we record the value of the local variable U of the processor (the collection U does not change in line 03, but we can still record U ). This gives rise to a sequence
. . of collections of intervals, where U k i is the value of collection U of processor i recorded in line 03 of the iteration number k + 1 of the while loop, or U k i = ∅ when processor i does not reach iteration k + 1, in the given execution. A convenient way to think about the superscript k is that U k i is the collection U of processor i recorded right after iteration k has been completed by the processor -this is why we start the sequence of superscripts from zero. For example U 0 i is the value of the collection U recorded at the beginning of the first iteration (right after "iteration zero" has been completed). By inspecting the code, we see that U 0 i is always equal to {[0, n − 1]}, for any i, because when processor i executes line 03 for the first time, the only instructions that it has executed earlier are these in lines 01 and 02. Processor i may or may not halt in a given execution. If it does not halt, then the value U k i is well defined by the first segment of the definition (processor i will execute line 03 for iteration number k + 1, for all k ≥ 0). However, if the processor halts in iteration number k + 1 < ∞, for some k ≥ 0, then U k i is the last collection recorded for processor i, and line 03 will not be reached in iterations k + 2, k + 3, . . .. Then the second segment of the definition puts U r i = ∅, for all r ≥ k + 1. So the sequences of collections are well defined for any execution and any processor.
We introduce additional terminology and notation used in the analysis of the algorithm. We say that a processor is working on an interval when it is executing its first RMW in the for loop (lines 06 to 10), or any instruction during this loop until the processor has executed the last RMW in the for loop. At the moment when this last RMW has been executed, we say that the processor has completed working on an interval. Note that this last RWM can be either executed successfully or not. Also, during each iteration of the while loop, the processor may be working on a different interval than in other iterations. For a fixed execution, processor, and iteration of the while loop, we let U z denote the value of U right before the processor executes line number z of the iteration, and U z the value of U right after line number z. It will be clear form the context which execution, processor, and iteration U z or U z refer to.
The analysis starts with three lemmas and a corollary that reduce the analysis of the algorithm to the analysis of properties of collections of intervals. The first lemma shows that the collections of intervals, recorded over time as the algorithms unfolds, have specific structure. 
Lemma 3.11. Consider any moment (of the global clock) during an execution of the algorithm, and let
Proof. The proof is by induction on the moments in the execution when the values of k i 's change (see Figure 7) . We shall see that the lemma holds from the moment when processors begin execution until, but excluding, the first time any processor has completed working on an interval. Then we will consider two moments: a moment immediately before any moment when a processor i has completed working on an interval, and the moment when the processor has completed working on Figure 7 : The value k i for processor i increases at the moment when the last RMW instruction is executed in any iteration of the while loop that this processor executes. The induction of Lemma 3.11 focuses on these moments. the interval. Thus the value of k i increases by one from the first moment to the second, and no processor executes any action in between these moments. We will argue that if the hypothesis holds at the first moment, it holds at the second moment as well, and also later, until immediately before the next time when a processor (possibly different than i) has completed working on an interval.
Let us consider the base case. The lemma is true right before, for the first time, a processor has completed working on an interval. Indeed, the first line of the collision algorithm for any processor i sets U to {[0, n − 1]}, and the value of U is not changed at least until the processor reaches line 11 of the first iteration of the while loop. Thus
At the moment immediately before a last RMW is executed for the first time by any processor, each processor i is either working, or will work on an interval from U 0 i . Note that collections U 0 1 , . . . , U 0 p are regular, as each contains the same single interval of length that is a power of two. Trivially, the collection U 0 i is monotonic for any 1 ≤ i ≤ p, because any single collection is always monotonic. For the inductive step, pick a moment immediately before any moment when a processor has completed working on an interval, and assume that the lemma is true then. Let this be processor i, and let k h ≥ 0 denote the number of times processor h has competed working on an interval. Processor i is executing iteration number k i + 1 of the while loop, and is about to complete working on an interval for the (k i + 1)-th time. When processor i has finally completed working on the interval, we have two cases: either the processor has succeeded on the last RMW or has failed. Case 1: If processor i succeeds on the last RMW, then it means that the processor has successfully RMW to all cells in the interval I that it has been working on. Notice that the processor never reads any memory cell that could be written to by a different processor between now and the moment when the processor reaches line 03 again in the next iteration number k i +2 of the while loop, if the processor ever reaches this line again. Therefore, the value of U k i +1 i is already determined. Since the processor has been working on an interval I from U = U are monotonic, and so the collection U 24 is equal to U 19 .
Second, assume that processor j was working on J in the same direction as processor i has been on I. We now study two subcases depending on the success of i in RMW to any cell in I.
Subcase number one is when processor i has managed to successfully RMW to at least one cell in I. Then D = ∅ and so the processor will execute line 21. We now argue that specific relationships must hold between D, D , and x. If i and j worked to the right, then x cannot be the second or later to the right element of J, because then the first element of J would be successfully RMW by j, and so i would have failed on a RMW to an element different than x, as ensured by the fact that i has been working on consecutive elements from the interval I and that D = ∅. So x must be the first element of J, and so x is the smallest element of D and x − 1 is the largest element of D. Similarly, if i and j were working to the left, then x is the largest element of D and x + 1 is the smallest element of D. These relationships ensure that when i has executed line 21, by Lemma 3. Subcase number two is when processor i failed on its first RMW to a cell in I. Then D = ∅ and so processor i will execute line 22. Since i has failed on its first RMW, then, when i works to the right, x must be the smallest element in I and x the largest element in D , or, when i works to the left, then x is the largest element in I and x the smallest element in D . As a result, after i has executed line 22, by Lemma 3.9, the collections U 22 i , U 24 are monotonic. We can carry out the same analysis as in the case of a successful last RMW described in Case 1 earlier, to show that any processor either has halted, will halt, is working, or will begin work on an interval, at any moment before the next time some k g is increased, and that the collections are regular and monotonic as desired.
This completes the inductive step and the proof.
The technique used in the proof of Lemma 3.11 could be called the technique of eventual invariant. We formulate an invariant, take any execution that is a sequence of events, show that there is an event when the invariant holds, and that, due to the properties of the asynchronous model, eventually there is another event when the invariant holds again, or the algorithm terminates. Proof. Using a straightforward inductive argument similar to that given in Lemma 3.11, we can demonstrate that any cell x ∈ {0, . . . , n − 1} \ U k i i has been set to 1. In particular, given any processor and its collection U at any moment of the execution, any cell in {0, . . . , n − 1} \ U is known to have been set to 1, along with tips marked by the processor. For the second part, observe that a processor halts only when its U is empty.
The next lemma shows that any processor has at most p/2 intervals in its collection U at any time during execution. The key algorithmic tool that yields this bound is the technique of marking tips of intervals. Recall that a processor splits an interval only when all tips of all intervals in U are marked. So in order to produce many intervals, there must be a moment when the processor has many marked tips. This, however, means that the processor must have collided with many other processors. We can infer that during these collisions, knowledge about progress must necessarily be transferred from other processor to the processor, and so the processor must necessarily remove a tip, thus preventing the number of intervals from growing too much. Proof. The proof is by induction on the moments in the execution when the values of k i 's change, as in Lemma 3.11.
For the base case, we observe that the lemma is true right before, for the first time, a processor has completed working on an interval, because U 0 1 = . . . = U 0 p are collections, each containing just one interval [0, n − 1] with no marked tip.
For the inductive step, pick a moment right before any moment when a processor has completed working on an interval. Let this be processor i, and let k h ≥ 0 be the number of times processor h has completed working on an interval then. Assume that collection U k h has at most p/2 intervals, for any 1 ≤ h ≤ p and 0 ≤ k ≤ k h . Processor i is executing iteration number k i + 1 of the while loop, and is about to complete working on an interval for the (k i + 1)-th time.
Let us bound the number of intervals in U when processor i reaches line 24 of this iteration. During the time when i is working on the interval, collection U of i is equal to U k i i . If the last RMW in this iteration is successful, then when the processor reaches line 24, then the number of intervals in U is one fewer than in U k i i , and so it is at most p/2 − 1. If the RMW failed, then the processor must have collided with a different processor j, and so the value of the variable U 14 of processor i is equal to U m j , for some 0 ≤ m ≤ k j . By the inductive hypothesis the number of intervals in U 14 is at most p/2. After the processor i has executed lines 15 to 23, the number of intervals in U can be at most the maximum of the number of intervals in U k i i and U m j , because of Lemma 3.3, Lemma 3.5, Lemma 3.7, and Lemma 3.9. And so U 24 has at most p/2 intervals.
We now study the evolution of U until processor i reaches line 03 again, if ever. We begin the evaluation with three simple cases. First, if U 24 is empty, then the result is trivial because U k i +1 i is empty. Second, if U 24 has an interval with an unmarked tip, then processor i does not execute lines 27 to 29, and so U k i +1 i is equal to U 24 . Third, if U 24 is not empty, all tips of all intervals in U 24 are marked, and U 24 has strictly fewer than p/2 intervals, then the processor i executes lines 27 to 29. But then splitting an interval increases the number of intervals by exactly one. Consequently, in the third case, U
has one more interval compared to U 24 , and so the number of intervals is bounded by p/2. In all these three cases the inductive step follows.
The final and most interesting case of the inductive step is when the collection U 24 has exactly p/2 intervals and all their tips are marked. We show that this cannot happen by way of contradiction.
Let the collection U 24 be composed of the intervals I 1 , . . . , I p/2 . Since all intervals of length 1 with marked tips were removed in line 24, each of the p/2 intervals has length at least 2. Thus the intervals have exactly p distinct tips and the tips are marked.
How can the tips get marked? By inspecting the code we see that a tip x can only get marked after i collides with a processor that worked in the same direction when performing RMW to x.
Then either D = ∅ or D = ∅. When D = ∅, i fails on RMW to tip s in an iteration, and then s remains as a tip and gets marked in this very iteration. When D = ∅, i fails on RMW to x other than s in an iteration, and then x becomes a tip and gets marked in this very iteration. Consequently, in order for any tip x to be marked, processor i must fail on RMW to x. So processor i must have failed on RMW to the p distinct tips by the moment it reaches line 24 of iteration k i + 1.
Could any of these tips be successfully RMW by i? Note that if processor i had done a successful RMW to a tip of its interval during the for loop in a k-th iteration of the while loop, k ≤ k i + 1, then this tip (not necessarily entire interval) would have been removed from its collection before the processor i reaches line 24 of the while loop in iteration number k, and, by the monotonicity property (iv), no subsequent collection U of processor i can contain the tip. Therefore, when processor i reaches line 24 of the current iteration number k i + 1, all p distinct tips of the intervals I 1 , . . . , I p/2 have been successfully RMW by processors other than i.
By the pigeonhole principle, therefore, there is a processor j other than i, such that the processor j had successfully RMW to two of the p tips on which i failed. Let x and y be these two tips, such that x was RMW before y was, according to the total order established by the RMW instructions. Let j x ≤ j y be the iteration numbers of the while loop of processor j during which the processor did successful RMWs to the two tips x and y respectively. We consider two cases depending on whether x and y were RMW in the same iteration or different iterations of the while loop.
For the first case, suppose that x was RMW in an earlier iteration i.e., j x < j y . Then processor j must have removed the cell x from its set U by the end of the iteration j x , and so in all subsequent iterations, the collection U of processor j does not contain any interval that contains x. So when j performs a successful RMW to cell y, it stores a collection there, so that none of the intervals of the collection contains x. Subsequently, processor i fails on RMW to cell y, and then the processor retrieves the collection stored in the cell. At that time, no interval of the collection U 14 of processor i contains element x. Hence, after intersection and by monotonicity, the interval U 24 of the iteration number k i + 1 of processor i does not contain any interval that contains x. This is a desired contradiction because we assumed that x is a marked tip of an interval in the collection U 24 .
For the second and final case, assume that x and y were RMW during the same iteration of processor j's while loop i.e., j x = j y . Hence when j does RMW to cell y, its interval D contains x. Subsequently, processor i fails on RMW to cell y and retrieves the interval from the cell. So in the iteration when the retrieval occurs, the D 14 of processor i contains two distinct elements x and y. Suppose that during this iteration i was working in the same direction as the direction in which j was working when j did a successful RMW to y. Assume for a moment that the direction was R (right). Moreover, assume that processor i succeeded in one or more RMW in this iteration. Since y is the only cell on which processor i can fail in this iteration, i must have succeeded on the RMW to cell y − 1. But this cannot be the case, because x is to the left of y and all cells between x and y inclusive had been successfully RMW by j. A similar contradiction is reached when the direction of work was L (left). Hence, in the iteration when retrieval occurred, it cannot be the case that processor i worked in the same direction as processor j, dir 14 = dir 14 , and processor i succeeded in a RMW, D 14 = ∅. As a result, i cannot execute line 21, in this iteration and must execute either line 19 or line 22. Consequently, at least one x or y is removed from U . Again, this leads to a contradiction.
The inductive step is completed, which proves the lemma.
During an execution, a processor performs some number of iterations of the while loop. The next lemma shows an upper bound on this number. The lemma is proven by noticing that the sequence
. . of collections of intervals cannot have too long subsequences of equal collections, because the number of marked tips would increase, and that when two consecutive collections are different, then processor i makes substantial progress on its work. This means that the successive collections quickly become "slimmer and slimmer", and eventually become empty. Lemma 3.14. In any execution of the algorithm any processor performs at most p 2 +p (2p + 1) log n iterations of the while loop.
Proof. Let us consider any execution, any processor i, and the sequence of all collections U 0 i , U 1 i , U 2 i , . . . that have been recorded for this processor in this execution. Since each collection was recorded, the collection is non-empty. This sequence may perhaps be infinite. We will argue that the sequence of recorded collections cannot have too long subsequences of consecutive equal collections, and that when two subsequent collections are different, then there is substantial difference in the total number of elements that belong to the intervals of the consecutive collections. Hence the sequence of collections "slims fast". As a result, we will be able to conclude that the sequence of recorded collections is finite, and in fact quite short. Thus the number of iterations will be small, too, because the number of iterations is exactly equal to the number of recorded collections.
We begin with an observation that if two consecutive recorded collections are equal, then there is one tip that is unmarked in the previous collection and the tip is marked in the subsequent collection, while all tips already marked in the previous collection remain marked in the subsequent collection. Indeed, suppose that
. Note that when a processor is working on an interval, then the tip s from which the processor has started the work is unmarked. Recall that U k i is recorded at the beginning of iteration k + 1, and U k+1 i at the beginning of iteration k + 2. Since the two collections are equal, and transformations applied to U in the iteration are monotonic, the value of the collection U must stay intact during iteration k + 1. In particular, U does not change from the moment right before line 11 until right after line 29. In order for U to remain the same, the processor must have failed on its first RMW to the tip s of the interval on which i was working, and the failure must have been because a different processor j had successfully RMW to s while j had been working on an interval in the same direction as i was when i attempted RMW to s (otherwise the value of U would change). As a result, the tip s that was previously unmarked becomes marked in line 23. Note that no tip x of any interval in any collection is ever unmarked by a processor, unless a part of the interval that contains x is removed from the collection. Hence the number of marked tips in U k+1 i is one plus the number of marked tips in U k i . This leads to an observation that the length of a sequence of consecutive equal collections must be bounded. Suppose that for some k ≥ 0 and c ≥ 0, collections
have been recorded, and they are equal
. By Lemma 3.13, the collection U k i has at most p/2 intervals and so at most p unmarked tips. Because each subsequent collection in the sequence has one fewer unmarked tip and, by Lemma 3.13, no collection can have p marked tips, c can be at most p − 1. So in the sequence of recorded collections, there can be at most p consecutive equal collections.
We inspect what must happen when two consecutive collections are different. Take any
have been recorded, and c is the largest number so that . We consider all the ways in which the execution of the processor i may proceed from line 11 until line 29 of iteration k + c + 1, and study the changes to U .
The plan for the subsequent analysis is to consider the first moment when U changes in lines 11 to 29. We know that a change must occur somewhere inside these lines. Then, by the properties of the transformations of intervals, we can conclude that a "big" modification must occur, and that this modification must carry over to the end of the iteration, by monotonicity of transformations. Then we can count how many times these modifications can happen in the sequence
In the paragraphs that follow we will be assuming that U does not change throughout larger and larger initial sections of the code of lines 11 to 29.
If processor i succeeded on the last RMW, then it executes line 12, where an interval is removed from U . By property (i), this interval has length that is equal to at least half of the length of a longest interval in U . Let j = i be the processor that successfully RMW to the cell x while working on an interval. Now we have two cases: either processor i and j worked in opposite directions or the same direction.
For the first case, assume that the colliding processors i and j worked in different directions. Then, by Corollary 3.6, the total number of elements in the intervals of U 19 is at most the total number of elements in the intervals of U can only have even fewer elements. For the second case, assume that i and j worked in the same direction. Then processor i either executes line 21 or 22. In the former case, the value of U always changes, in the later it may change or not. Suppose that U changes i.e., U 23 = U 21 . Then, by Corollary 3.8 and Corollary 3.10, the total number of elements in the intervals of U 24 is at most the total number of elements in the intervals of U . We now count how many times each of these three modifications can happen until U becomes empty (in which case processor i must halt). Reduction of length by the factor of 2 or more can happen at most log n times, a longest interval can be split into two halves at most p log n times (because the number of intervals in a collection is bounded by p/2), and at least a half of a longest interval can be removed at most p (1 + log n) times. As a result U p(log n+p log n+p(1+log n)) i = ∅, and the lemma has been proven.
We are now ready to prove the main result of this paper. Proof. Consider any execution of the algorithm. By Lemma 3.13 each processor performs a bounded number of iterations, and, by Corollary 3.12, when a processor stops all cells of w have been set to 1. Hence the algorithm solves the CWA problem.
We now argue about the work complexity of the algorithm. Let us fix a processor and divide each of its iterations of the while loop into two parts. The first part contains the instructions starting from the first RMW of the for loop until but not including the last RMW of the for loop, while the second part contains all other instructions in the iteration (the second part has two discontinuous sections of instructions). Note that all RMW in the first part are successful and so the combined work of all processors on the first parts is O(n) (using a pointer representation of the collections) and at most n cells of the array w are set to 1 by the p processors during the first parts. For any processor, the number of second parts is equal to the number of iterations, which is bounded, by Lemma 3.14, by p 2 + p (2p + 1) log n. So the total number of iterations performed by all processors is bounded by p p 2 + p (2p + 1) log n . During each second part at most one cell of w can be set to 1, and so the p processors combined may set to 1 at most p 3 + p 2 (2p + 1) log n cells of w in their second parts. Recall that, by Lemma 3.13, the number of intervals in any collection during the execution is at most p/2 and so any second part takes O(p) instructions to execute. Thus the combined work that processors performed on the second parts is O(p 4 log n). This completes the proof.
Recall that the code of the algorithm assumes that Read-Modify-Write can transfer O(p) cells between local and shared memory. We now discuss how to relaxed this assumption, to comply with our model which allows only O(1) cells to be transferred, by using pointer representation of collection U . Proof. Each processor i maintains a shared memory array where collections U 0 i , U 1 i , U 2 i , . . . will be placed (see Figure 8) . The array has p/2 rows and p 2 + p (2p + 1) log n columns, and is stored in a dedicated section of shared memory. The location of the section inside shared memory can be selected using the unique identifier of the processor. Each entry of the array stores two numbers representing tips of an interval. Recall that in our model each memory cell can accommodate O(log n) bits. Hence the total number of shared memory cells sufficient for the p arrays is O(p 4 log n).
When processor i begins iteration k, it transfers the collection U to column k of the array. In the algorithm as stated in Figure 2 , each RMW transfers the current collection U to a cell of the trace table tab. This violates the model, because we may need to transfer O(p) cells, while in our model we assume that RMW can access a constant number of cells only. In the actual version of the collision algorithm, the processor will transfer a pointer to column k of the array during any RMW operation of iteration k. Thus, assuming that a pointer takes O(log n) bits, RMW only needs to transfer a constant number of cells. In the algorithm as stated in Figure 2 , processor reads U directly from the trace table. In the actual algorithm, the processor will read a pointer to a column of an array, and retrieve a collection of intervals from the column.
The number of rows allocated for each processor is sufficient, by Lemma 3.13, to accommodate any U produced during any execution of the algorithm. The number of columns is sufficient, by Lemma 3.14, to complete one execution of the collision algorithm.
Future work
It should be possible to tighten the analysis of work complexity of the algorithm by further exploring the information flow between processors. We believe that the actual work complexity of our algorithm is no more than O(n + p 3 log n), because the analysis given in the proof of Lemma 3.14 appears to have a fair amount of slack.
We believe that it should be possible to further reduce the work and space complexities of the algorithm by modifying data structures. Specifically, instead of creating a new U during each iteration, we could try to reuse the parts of U that have not changed since the prior iteration. This should decrease space complexity, but may increase work complexity as the new representation of U may be more "dispersed". In the proof of Theorem 3.15, we bounded by O(p) the time to perform the transformations of U . Is Θ(p) necessary? It would be interesting to find a representation of U that would allow to perform r transformations in time o(rp) and space o(rp) possibly by exploiting regularity and monotonicity.
In our algorithm there may be more than one unmarked tip to select from at the beginning of an iteration of the while loop. Currently, the algorithm arbitrarily selects an unmarked tip of an interval. However, one can consider a more refined process of selection when there are choices. One could apply the techniques of "oblivious schedules" of Anderson and Woll [1] , in conjunction with identifiers of processors, so as to select a tip to begin work from, so that overall not too many processors also select this tip. The combination of the collision technique and the oblivious technique, seems a promising approach for constructing an algorithm that would have asymptotically optimal work for a wider range of the number of processors.
The algorithm uses a special RMW primitive to efficiently detect collisions. Is the use of such primitive essential for obtaining an algorithm with asymptotically optimal work for a wide range of the number of processors? Would the use of a more standard single word RMW, compare-and-swap or even atomic reads and writes be sufficient? A recent paper of Kowalski and Shvartsman [26] shows that atomic reads and writes are sufficient.
Work of any deterministic asynchronous algorithm for the CWA problem with p ≤ n 1− processors must obviously be Ω(n). We have shown that there is an O(n) algorithm when 1 ≥ ≥ 4/5. Is this true that there exists an O(n) algorithm for arbitrary close to 0? For what values of p compared to n there is a non-trivial lower bound on work of a deterministic algorithm? In particular, is ω(n) work necessary when p is o(n/ log n)?
Acknowledgements. The author thanks his host, Charles Leiserson, for an invitation to join the Supercomputing Technologies Group, and for discussions, including these on how to write a better proof. This paper was inspired by earlier work with Alex Shvartsman. The author thanks Bogdan Chlebus, Piotr Indyk, Arnold Rosenberg, Alex Russell, Alex Shvartsman, and the anonymous PODC'03 reviewers, for reading drafts of the paper and commenting on them. Their input improved the clarity of the presentation. The author particularly acknowledges the three SIAM Journal on Computing reviewers for the careful reading of the manuscript, the many detailed comments, and the suggestions to better explain how the algorithm works. Finally, the author thanks Mumtaz Lohawala for hosting him in her house for one year during the author's visit to MIT.
