In this paper we examine the general multi-word lock problem, where processes are allowed to multilock arbitrary registers. Aiming for a highly efficient solution we propose a randomized algorithm which successfully breaks long dependency chains, the crucial factor for slowing down an execution. In the analysis we focus on the 2-word lock problem and show that in this special case an execution of our algorithm takes with high probability at most time O(∆ 3 log n/ log log n), where n is the number of registers and ∆ the maximal number of processes interested in the same register (the contention). Furthermore, we implemented our algorithm for the general multi-word lock problem on an SGI Origin2000 machine, demonstrating that our algorithm is not only of theoretical interest.
INTRODUCTION
Edsger Dijkstra's dining philosophers problem is widely recognized as a prototypical resource allocation instance. We are given n philosophers, sitting at a round table. Each philosopher is an asynchronous process who cycles through the three states thinking, hungry, and eating. Between each neighboring pair of philosophers, there is a fork. When becoming hungry, a philosopher tries to grab her left and right fork. After having acquired both forks, the philosopher eats. When finished eating, the philosopher returns her forks and goes back to thinking mode.
We can represent the classic dining philosophers problem in a shared memory multi-processor system by having n shared registers (the forks) and n processes (the philosophers). The two registers ("forks") which are of interest to process pi (with i = 1, 2, . . . , n) are registers i and i + 1 (with the notable exception that the "right fork" of processor pn is register 1, and not n + 1, to achieve the desired ring topology). In shared memory dining philosophers, each process repeatedly and asynchronously tries to lock its two registers (hungry), then performs some atomic operation on these two registers, such as multi-word compare-and-swap (eating), and then continues with other operations (thinking).
The dining philosophers problem perfectly illustrates typical multi-process synchronization difficulties. If all philosophers become hungry at the same time, and pick up their left fork simultaneously, we have a deadlock, since no philosopher can grab her right fork as well. Similarly, if a process crashes (or behaves awfully slow) after locking its registers, the two neighbor processes cannot make progress; as a remedy the research community has proposed non-blocking protocols, such as recursive helping schemes, or transactional memory.
In this paper we focus on a third fundamental multiprocess synchronization issue, efficiency. In dining philosophers, "even" philosophers (processes with even process id) do not have a conflict of interest among themselves. An efficient implementation striving for maximum concurrency would therefore always let even and odd processes eat in turns, thus maximizing the available resources.
In this paper we examine the general multi-word lock problem, where processes are allowed to multi-lock arbitrary registers. The remainder of the paper is organized as follows: In Section 2 we set our paper into context of prior art. The model is then formally introduced in Section 3. In Section 4 we present our algorithm and analyze it; in particular we show that a process has to wait at most O(∆ 3 log n/ log log n) time until it can eat, where n is the number of registers and ∆ the maximal number of processes interested in the same register (the contention). In Section 5 we present extensive results from our implementation on an SGI Origin2000 machine, proving that our idea is not only of theoretical interest. Finally, in Section 6 we conclude the paper.
RELATED WORK
Since processes without a conflict can proceed concurrently it seems promising to first compute a minimum coloring of the conflict graph. Yet solving the multi-lock problem using coloring remained theory.
In a generalized variant of dining philosophers, a process shares d forks and can only eat if it has obtained all d forks. For this generalized problem [9] gave a solution with waiting chains of length O(c), assuming that an oracle has colored the conflict graph 1 with c colors. The waiting chain length was reduced to O(log c) in [13] . Assuming that a vertex coloring with d+1 colors is known, in [4] this length was further reduced to 3. For a simplified version of dining philosophers [11] manage to have waiting chains of length at most 4 in constant time.
Unfortunately, even in a powerful message passing model, coloring is a tough problem. It was proven in [8] that such colorings cannot be found in constant time. In fact, even simpler problems (such as independent sets) have logarithmic lower bounds [8] . More severely, the conflict graph is not available straightforwardly. To compute the conflict graph, processes need some form of synchronization. We believe that this synchronization is as hard to achieve as the original multi-lock problem.
In this paper we present an efficient but blocking algorithm for the general multi-lock problem, consequently without making a detour through coloring. A blocking algorithm for the multi-lock problem can be turned into a lock-free algorithm if it is combined with a helping technique. The basic idea of the so called cooperative technique [3] , a form of helping technique, was improved and is still developing in a series of nifty research papers [7, 12, 1, 10, 6, 5] .
For readability we do not integrate our algorithm with a helping scheme; however, it can be added to our algorithm: Each process, before locking registers, somewhere notes what it wanted to do with that register. In case the process crashes while holding the lock, others can help it finish by following its steps. The implementation of our algorithm used in Section 5 of this paper includes a helping scheme, rendering our implementation lock-free.
In distributed graph algorithms (message passing), randomization techniques are widely used. With a few exceptions [2] , the shared memory community does generally not apply randomization, presumably because its alleged overhead. Our experiments show that the overhead due to randomization is less than 1% of the total execution time.
PROBLEM AND MODEL
In this section we recapitulate the problem we consider and formally define the model used in the next sections.
We study the multi-lock problem, which is a generalization of dining philosophers. In the multi-lock problem each participating process needs to lock multiple registers in order to do some operation on the locked registers, like an N-word compare-and-swap (CASN).
Typically, in a multi-lock implementation a process tries to lock all its registers one by one. To avoid deadlocks, the registers are totally ordered, conventionally by their identifiers (id). When executing a k-lock, a process p locks its registers r1, . . . , r k according to their total order.
As discussed in Section 2, there exist several schemes which can be employed once a process is blocked by other processes from locking its registers. In the analysis we assume that processes simply wait (spin-lock) until the block is resolved, yet for the implementation (Section 5) we include a helping scheme.
We consider m asynchronous processes which can access n shared registers. In the analysis we concentrate on 2-locks. Each process only executes a single 2-lock and then goes to sleep.
The dependencies between the processes are modelled by a directed acyclic conflict graph G = (V, E). In G each node represents a register and each edge represents a process. In the following, we will use the terms node/register and edge/process interchangeably. There is a directed edge p from node r1 to node r2 iff process p tries to lock register r1 first and after being successful tries to lock register r2, meaning that r1 comes before r2 in the total order of registers. Since all directed edges point from nodes with lower id to nodes with higher id the resulting graph G is acyclic.
Following the conventions for asynchronous processes, in the analysis we assume that each atomic operation, like reading, writing, or locking a register, incurs a delay of at most one time unit. An operation on multiple locked registers (e.g. CAS2 ) incurs a delay. For convenience let c be the longest time which elapses from the moment a process has locked its last register until it releases the lock on all its registers.
In the remainder of this section, to illustrate our model, we quickly analyze a classical implementation of dining philosophers. We show that it is a factor Ω(n) less efficient than an optimal implementation.
The classical implementation proceeds as follows: the processes try to lock their registers one by one, each starting with the register with smaller identifier. Consider the following execution: First, each process pi, i < n, locks register ri, whereas pn fails to lock r1 (due to p1). Then, each process tries to lock register ri+1, yet only process pn−1 succeeds. The second register of all other processes is locked by another process. Thus, process pi has to wait until process pi+1 releases its lock on ri+1. By induction, after having waited Θ(cn) time units p1 releases its lock on r1 and pn may lock both its registers. Thus, the execution time is Ω(cn). An optimal implementation needs only O(c) time and hence the classical algorithm is a factor Ω(n) less efficient than an optimal algorithm.
RANDOMIZED REGISTERS
In this section we present a more efficient algorithm for multi-lock and analyze it according to two standard criteria for the special case of 2-lock.
The Algorithm
Alerted by the poor execution time of the classical algorithm for dining philosophers due to its long dependency chain, we aim at breaking dependency chains. A promising yet simple (allowing for an efficient implementation) approach is randomization. Specifically, we suggest to randomly permute the order of the registers. Let Π be a permutation on the registers, chosen uniformly at random. The permutation represents the new total ordering of the registers. For details on how the randomization can be implemented we refer to Section 5.
In short we henceforth write p = (ri, rj) meaning that process p wants to acquire register ri and rj and that Π(id(ri)) < Π(id(rj)). Thus, ri is p's first register and rj is p's second register.
In the next sections we analyze the efficiency of the suggested 2-lock algorithm which uses a randomized total ordering of registers. As in [13] we evaluate two related properties: the maximum length of a waiting chain and the longest time a process needs until it successfully performs a 2-lock. Towards this goal, we first prove some basic properties of the conflict graph
2 . Thereafter, we analyze the length of waiting chains and finally show that with high probability after O(c∆ 3 log n/ log log n) time the execution is finished.
Length of Directed Paths
Henceforth, we denote by G the conflict graph as obtained by the random permutation of the registers. Let the maximum degree in G be ∆. In this section we analyze the length of a directed path in G. The following facts are used for the analysis, the proofs of which can be found in standard mathematical textbooks.
Fact 4.2 (Markov)
.
Throughout the paper log n denotes the logarithm with base two.
To estimate the length of a directed path in G, we first upper bound the number of distinct undirected paths of length k in G: To obtain an undirected path of length k one can choose one out of n nodes in G as start node. In any node there are at most ∆ neighbor nodes to continue the path. Therefore:
Observation 4.3. There are at most n·∆ k distinct undirected paths of length k in G. 2 Note, that the conflict graph is only needed for analysis purposes. The processes do not know the conflict graph.
As a next step we give the probability that a given path of length k in G is directed.
Observation 4.4. The probability that a given path of length k in G is directed is
Proof. In a path of length k there are k + 1 nodes u1, . . . , u k+1 . For the path to be directed, it must hold that either Π(id(u1)) < Π(id(u2)) < . . . < Π(id(u k+1 )) or Π(id(u1)) > Π(id(u2)) > . . . > Π(id(u k+1 )). Hence, there are exactly two good out of (k + 1)! possible choices.
Thus, the probability that a path is directed decreases exponentially with increasing path-length. Combining both Observation 4.3 and Observation 4.4 gives an upper bound on the number of directed paths of length k.
Lemma 4.5. Let C be the number of directed paths of
Proof. Let pi denote a path of length k and let Xp i be defined as follows
Then by linearity of expectation,
by Observation 4.3 and Observation 4.4. Applying Stirling's formula (Fact 4.1) and substituting 3∆ log n/ log log n for k yields the following inequalities
(3∆ log n/ log log n) k < n 1 (log n/ log log n) k = n 1 2 3∆ log n/ log log n(log log n−log log log n) = n 1 n 3∆(1−log log log n/ log log n)
Finally, we bound the probability that there exists a directed path in G by applying Markov's inequality. Corollary 4.6. With probability at most 1/n ∆ there exists a directed path of length at least 3∆ log n/ log log n.
Proof. By Fact 4.2 and Lemma 4.5
Length of Waiting Chains
Following the notation of [13] we define a waiting chain as a series of processes such that each process in the chain is waiting for some action by the next process in the chain. Avoiding long waiting chains is important since it implies a long wait for the last process in the chain.
A process p is delayed by another process q if q can, by slowing down or stopping, cause p to have a longer total waiting time than if q stayed in its remainder section. The maximum length of a waiting chain is the maximum distance between two processes such that one process can delay the other. We want to intuitively depict the concept of delaying with the example of Figure 1 . Process p can be delayed by all processes in the figure: E.g. process q2 can delay p if q1 locks r1 before p and q2 locks r6 before q1. Thus, q1 has to wait for q2 until it can acquire (and release) both its locks and consequently p (which waits for q1 to release the lock on r1) also has to wait for q2. Process q6 can also delay p: q6 locks r3, q3 locks r2 and p locks r1. Thus, p is waiting for q3 to release r2 which itself is waiting for q6 to release r3. On the other hand, process p could not be delayed by a (not shown) process q7 = (r4, x), where x is some register not depicted in Figure 1 . This is because process q7 may acquire register r4 before q6 and thus q6 has to wait until q7 releases this register again before it can proceed. Yet, this does in no way delay p since it does not impose a longer waiting time on p. In general:
Lemma 4.7. Process q = (R1, R2) can delay process p = (r1, r2) if and only if R2 lies on a directed path starting in either r1 or r2.
Proof. By definition, if process q delays process p the waiting time of p must be longer if q slows down or stops than if q stayed in its remainder section. If processes q's second register R2 lies on a directed path starting in ri, i ∈ {1, 2}, then there exists a directed path of processes q1 = (ri, rj), . . . , q k = (r l , R2), with possibly q k = q. In the case that each process in this path locked its first registerand given that q k =also locked its second register-none of the processes q1, . . . , q k makes any progress as long as q does not make any progress. Thus, process p will not be able to lock its register ri and hence its waiting time is longer than if q stayed in its remainder section, showing that q delays p.
In order to show the other direction of the lemma we let Q be the set of processes which do not lie with their second register on a directed path from ri, i ∈ {1, 2}. Between any arbitrary process q in Q and any directed path Pi from ri there is at least one processq which breaks this directed path, that isq's second register lies on Pi whereas its first register does not. By slowing down or stopping its execution q may hinderq in acquiring its first register, yet it does not hinderq in acquiring its second register, otherwise one of q's registers would also lie on a directed path from ri, a contradiction to the assumption that q is in Q. Thus, by slowing down or stopping q either does not affectq orq cannot participate in the execution at all as long as q does not make any progress. Hence, process q cannot delay
The maximum length of a waiting chain in the randomized registers algorithm is with probability at least 1 − 1/n ∆ at most 3∆ log n/ log log n + 1.
Proof. The maximal number of edges between a process p and a process q which delays p is at most the length of the longest directed path plus one, since by Lemma 4.7 a process which delays p must be incident with its second register to a directed path. Hence, we can directly apply Corollary 4.6.
Execution Time
Though the length of a waiting chain is an indicator of the efficiency of an algorithm, it is only a lower bound for the execution time. For the execution time we must bound two values: First, we need to bound the time until a process is able to lock its first register, then we need to bound the time until it can lock its second register. Towards this goal we introduce some helpful definitions.
The execution starts at time zero. A process p = (r1, r2) locks its first register at time t1(p) and its second register at time t2(p). Using the definition of Section 3 process p releases both its locks at time t3 In the example of Figure 3 processes p, q1, q2, q3, q4 are in p's delay graph, whereas process q5, q6 are not, since there is no directed path from r1 to q6's first register r4, respectively q5's first register r8. Intuitively, processes which are incident to a directed path from p merely by their second register, do not delay p much, since those processes release their lock on the crucial register quickly after acquiring it. Note, that the depth of a process is at least one since at least the process itself lies in its delay graph.
The depth of p is 3, q1's depth is also 3.
We now bound the maximal depth of any process by directly applying Corollary 4.6:
Corollary 4.10. The maximum depth k * of any process is at most 3∆ log n/ log log n with probability at least 1 − 1/n ∆ .
The next lemma reveals a key property of the delay graph. Proof. Assume without loss of generality that PR j u = (Rj , u1, . . . , u k , u), j ∈ {1, 2}, is a longest directed path in q's delay graph. Then, depth(q)= |PR j u| = |{Rj , u1, . . . , u k , u}| − 1. Since q ∈ D(p) there is a directed path Pr 1 R 1 = (r1, v1, . . . , v l , R1) from r1 to R1, where |Pr 1 R 1 | = |{r1, v1, . . . , v l , R1}| − 1 is the length of this path. Consequently, there exists a directed path Pr 1 u = (r1, . . . , R1, Rj , . . . , u) from r1 to u. Thus,
If furthermore, r1 = R1 then depth(p) = |{r1, . . . , v l , R1, Rj, . . . , u}| − 1 ≥ |{r1, . . . , v l }| + |{R1, Rj, . . . , u}| − 1 ≥ 1 + |PR j u| > depth(q).
The following corollary shows that along a directed path the depth of the processes is strictly decreasing.
Corollary 4.12. Let P = (r1, r2, . . . , r k+1 ) be a directed path and let pi = (ri, ri+1), 1 ≤ i ≤ k, be the processes on this path. Then, depth(pi) > depth(pi+1),
Proof. By the definition of a delay graph, a process q = (R1, R2) lies in the delay graph D(p) of process p = (r1, r2) iff there is a directed path between r1 and R1. Thus, process pi+1 lies in the delay graph of process pi since by the assumption ri and ri+1 lie on a directed path. We hence may apply Lemma 4.11 which states that the depth of a process q which lies in the delay graph D(p) of process p is strictly smaller than p's depth if p's first register is not equal to q's first register. Since in our case the first register of pi+1 is ri+1 and the first register of pi is ri this condition holds and thus the depth of pi+1 is strictly smaller than pi's depth.
Corollary 4.13. There is a process with depth one in any conflict graph G.
Proof. Let p1 = (r1, r2) be a process in G with depth k. Then, there exists a directed path P = (r1, r2, . . . , r k+1 ) from r1 to some node r k+1 of length (number of processes in P) k. By Corollary 4.12 the depth of the processes pi = (ri, ri+1), 1 ≤ i ≤ k, in this path is strictly decreasing. Thus, depth(pi+1) ≤ depth(pi) − 1, 1 ≤ i ≤ k − 1, and since depth(p1) = k we have depth(p k ) ≤ 1. The depth of any process is at least one and consequently depth(p k ) = 1.
In the next lemma we upper bound t3(p) for a process p with depth one. Lemma 4.14. For a process p = (r1, r2) with depth(p)=1 we have t3(p) ≤ 4c∆ 2 .
Proof. A process p has depth one iff the following two conditions hold: A process qj incident to p's first register r1 is either incoming in r1 (type a), that is qj = (x, r1), x an arbitrary register, or qj = (r1, x) and all processes incident to qj's second register x are incoming in x (type b). A process qi incident to p's second register r2 is incoming in r2, that is qi = (x, r2), x an arbitrary register. (See also Figure 4 .)
We first concentrate on type a processes: Processes of type a releases their lock on r1 at most c time units after acquiring it. The next process acquires the lock on r1 at most one time unit later. Thus each process of type a adjacent to r1 delays p for at most c + 1 ≤ 2c time units.
A process qj = (r1, x) of type b must wait at the utmost for all processes incident to its second register x until it can acquire the lock on x and thereafter release r1. A process incident to x releases its lock on x at most c time units after acquiring it and the next process acquires it at most one time unit later. Thus, each process incident to x delays qj for at most c + 1 ≤ 2c time units. Besides qj there are at most ∆ − 1 processes incident to x. We thus immediately get that qj acquires its lock on x after at most 2c(∆ − 1) + 1 time units and releases its locks after at most c more time units. Thus each process of type b adjacent to r1 delays p for at most 2c(∆ − 1) + 1 + c ≤ 2c∆ time units.
Besides p there are at most ∆ − 1 processes incident to r1, each of which releases its lock on r1 at most 2c∆ time units after acquiring it. Therefore, we immediately get
The time until p can lock r2 is by the same argument as the argument for type a processes at most 2c(∆ − 1) + 1 and hence
Lemma 4.15. For a process p with depth(p)=k we have
Proof. We prove the theorem by induction on the depth of a process p = (r1, r2). By Corollary 4.13 there always exists a process of depth one in the conflict graph G and thus we may base the induction in this case.
Base Case: In case that depth(p)= 1 t3(p) ≤ 4c∆ 2 by Lemma 4.14.
Induction: We henceforth assume that for a process q with depth(q)≤ (k − 1) it holds that t3(q) ≤ 4c∆ 2 (k − 1) and consider process p with depth k. By Lemma 4.11 all processes in p's dependency graph D(p) which do not have r1 as their first register have depth less than k and thus -by the induction hypothesis-finished their operations at time 4c∆
2 (k − 1) at latest. Thus, the only processes in D(p) which are still active are those which have r1 as their first register and consequently at time 4c∆ 2 (k − 1) p's depth is at most one. We then apply Lemma 4.14 and get
Theorem 4.16. A process p finishes its operations after time O(c∆ 3 log n/ log log n) with probability at least 1 − 1/n ∆ .
Proof. By Corollary 4.10 the depth of any process is at most 3∆ log n/ log log n with probability at least 1 − 1/n ∆ . Thus, using Lemma 4.15,
2 · 3∆ log n/ log log n ∈ O(c∆ 3 log n/ log log n).
In a model where an operation takes exactly time c clearly also an optimal algorithm needs at least the congestion times c time units. Therefore: Corollary 4.17. With probability at least 1 − 1/n ∆ the randomized registers algorithm is O(∆ 2 log n/ log log n) competitive.
EVALUATION
We have proposed a multi-lock algorithm, where the operation performed after all registers are locked can be defined arbitrarily by the programmer. To evaluate the algorithm, we chose the operation specifically to be a single-word compare-and-swap on each register. With this choice, our algorithm became a multi-word compare-and-swap algorithm.
The multi-word compare-and-swap operations (CASN) extend the single-word compare-and-swap operations from one word to many. A single-word compare-and-swap operation (CAS) takes as input three parameters: the address, an old value and a new value of a word, and atomically updates the contents of the word if its current value is the same as the old value (cf. Figure 5) . Similarly, an N-word compare-and-swap operation takes the addresses, old values and new values of N words, and if the current contents of these N words all are the same as the respective old values, the CASN will write the new values to the respective words atomically. Otherwise, we say that the CAS/CASN fails, leaving the variable values unchanged.
else return (f alse); } Figure 5 : The single-word compare-and-swap primitive
The multi-word compare-and-swap operations are powerful constructs, which make the design of concurrent data structures more effective and easier. As expected, they attracted the attention of many researchers, consequently many CASN implementations appear in the literature [7, 12, 1, 10, 6, 5] . One approach suggested to construct CASN operations is cooperative technique, which allows processes to concurrently access the shared data as long as they write down what they are doing. Before changing a portion of the shared data that was locked by another process pj , a process pi must help pj complete its task first. The technique was first theoretically suggested by Barnes [3] and then was transformed into a more applicable one by Israeli et al. [7] , which was used to implement a lock-free multiword compare-and-swap operation. This implementation was later improved by Harris et al. [6] to reduce the perword space overhead. A wait-free multi-word compare-andswap was developed by Anderson based on this technique [1] . However, this cooperative technique uses a recursive helping policy, where a process has to help many other processes before completing its own task. The helping chains, where process pi helps pi+1, may be very long. All processes related to a chain may do the same task, the task of the last process in the chain, which reduces parallelism and creates high collision levels on the shared data needed by the common task.
In order to evaluate the performance of our algorithm (randomized CAS, in short RaCASN) and also check its feasibility in a real setting we implemented it and ran it on a ccNUMA SGI Origin2000 multiprocessor that was equipped with 30 CPUs. As discussed in Section 2 we equipped our randomized registers algorithm with a helping policy [7] . In order to see in practice the performance benefits of the ran-domization we also implemented the deterministic recursive helping policy (DeCASN) presented in [7] . The implementation of RaCASN was similar to that of DeCASN except that the order of registers/words 3 chosen to be locked was random in RaCASN. In other words, both algorithms are lockfree. For the tests we used a micro-benchmark and a small application. The micro-benchmark was designed to generate an execution environment with high contention on the shared registers. The application was a parallel-prefix application with continuous input feed and space constraints.
The micro-benchmark
The micro-benchmark aims at generating an environment with high contention on shared registers. In the microbenchmark, a set of N +k virtual registers vi, 1 ≤ i ≤ N +k, are mapped on k system registers r1, r2, · · · , r k , where N is the number of registers to be updated atomically by the CASN operations. The mapping used is
The virtual registers are accessed by k N -word compareand-swap operations CASN1, CASN2, · · · , CASN k . During the execution of this benchmark, each CASNi operation tries to update virtual registers vi, vi+1, · · · , vi+N−1 atomically. Note that two consecutive CASNs CASNi and CASNi+1 have N − 1 system registers in common and thus the micro-benchmark can generate helping chains of length up to k, where a helping chain is a chain of CASN operations that a thread has to help before completing its own CASN.
In our experiment, we ran the micro-benchmark with De-CASN and RaCASN. In the RaCASN implementation, k random numbers corresponding to the k system registers were precomputed and stored in a shared array. For each execution, we measured the longest helping chain and then computed the distribution of the chain lengths over one million executions. We also measured the average execution time of the micro-benchmark using DeCASN and RaCASN. Our experiment ran the micro-benchmark with 28 threads on 28 processors of the SGI Origin2000 machine. We tested the benchmark with N = 2, 4, 6 and 8, i.e. CAS2, CAS4, CAS6 and CAS8. The results are presented in Figure 6 and Figure 7 . Figure 6 shows that RaCASN breaks the helping chains much better than DeCASN, thus making themselves faster. Long helping chains degrade the efficiency of the whole system since all processors related to a chain try to lock the same registers of the last CASN in the chain, which generates high collision levels on these registers.
Results:.
In the case of CAS2 in Figure 6 , RaCASN exhibits executions with the longest helping chain of length 4 in 61% of the total number of executions, of length 3 in 21% of the total number of executions and of length 5 in 16% of the total number of executions. The RaCASN longest helping chain over one million executions has length 7 in 0.2% of the total number of executions. Regarding DeCASN, it exhibits executions with longest helping chains of length 20 in 32% of the total number of executions, of length 21 in 16% of the total number of executions and of length 19 in 15% of the 3 Terms register and word can be used interchangeably. total number of executions. The DeCASN exhibits executions with longest helping chain of length 28, the maximal number of CASN operations, in 4% of the total number of executions.
When the number of registers to be updated increases, the distribution of RaCASN longest chain lengths shifts to the right slowly but is still much better than that of DeCASN as shown in the charts of CAS4, CAS6 and CAS8 (cf. Figure 6) . Note that the probability that one CASN must help another grows with N . However, the length of the longest helping chain may not increase since a successful CASN can reduce this length by at least N . We can observe this effect in Figure 6 where the highest bar in the DeCASN longest length distribution shifts to the left slowly when N increases from 2 to 8.
Since RaCASN helps the micro-benchmark break long helping chains, which by itself reduces collision on memory and increases parallelism, RaCASN achieves better performance on the benchmark as shown in Figure 7 . The RaCASN is from 20% to 31% faster than DeCASN. The overhead of computing k random numbers in RaCASN implementation is not significant, which consumed only 0.07 percent of the execution time.
The application
As we have experienced, an algorithm that gains good performance on a micro-benchmark may not keep such performance on a real application. This motivated us to do another comparison between RaCASN and DeCASN on an application.
The application comes from the following problem:
The problem:.
There are n registers r1, r2, · · · , rn, each of which belongs to one of n agents a1, a2, · · · , an. The agents communicate with the underlying computational system via these registers: agent a k reads a result in register r k written by the system before writing there a new input i k for the system. Input values i k are put in register r k randomly and independently of other agents. The input values change all the time dynamically. (We can think that they are inputs from sensors.)
The computational system computes an output/result o k for agent a k from the prefix i1, i2, · · · , i k . For simplicity, The longest length of wait-queues in one execution Figure 6 : The distributions of the longest wait-queue lengths in the micro-benchmark on the SGI Origin2000.
we assume that it computes a prefix-sum
The system writes the result o k back to register r k only if the values used to compute o k have not changed yet. That means:
• either all registers r1, r2, · · · , r k have not changed yet if o k is computed from i1, i2, · · · , i k , or
• registers r k−1 and r k have not changed yet if o k is computed from o k−1 and i k , where o k−1 had been written successfully to register r k−1 and no new input i k−1 has been put in this register since o k−1 was written back.
The efficiency of the computational system is evaluated by the number of results written successfully. The more results are written successfully, the better the system is.
A simple algorithm solving the problem:.
The following two observations can be made:
• The results must be computed as fast as possible in order to write them back to the registers before new inputs are put in them.
• Using o k−1 and i k to compute o k has higher probability of success than using i1, i2, · · · , i k .
Therefore, we use n threads t1, t2, · · · , tn, where the main task of thread t k is to compute o k fast. The algorithm is illustrated in Figure 8 .
In our experiment, the CASN operation in the algorithm was in turn replaced by RaCASN and DeCASN and then The algorithm for a thread t k in computing one result/output the average execution times of the application were measured over one million executions. The number of registers or threads n was varied from 4 to 28. The experiment with higher n generates higher collision level on the registers due to the helping policy. In the experiment, each thread ran exclusively on one processor of the Origin2000 machine. The result is presented in Figure 9 .
Results:.
The experimental result shows that RaCASN helps the application run faster compared to DeCASN. It is up to 40% faster in the case of 28 registers or 28 threads. Figure 9 shows that with more threads the DeCASN/ReCASN speed-up relation grows. This implies that the randomization in RaCASN plays a significant role in reducing collisions on the shared registers, thus helping the application achieve better performance. The overhead of pre-computing n random numbers corresponding to n registers is not significant: it takes at most 0.7% of the execution time. (This worst case is measured in the case where the number of register is 4.) The number of registers
Execution time (microsec)
Original Randomized Figure 9 : The application execution times on the SGI Origin2000.
CONCLUSIONS
In this paper we advocated randomization for implementing multi-locking such as CASN efficiently. We showed that our approach is efficient, in theory as well as in practice.
In the past, multi-lock algorithms were usually evaluated by random simulations. That is, in an evaluation/simulation of an algorithm it was assumed that randomly chosen registers were accessed by the processes. We believe that this is a conceptual faux pas. In fact, shared memory processes operate on shared data structures (e.g. search trees, linked lists) which are accessed anything but randomly. In reality, as in dining philosophers, access is not random but wellstructured. For example, in a shared ordered linked list a process needs to multi-lock the two neighbor records in order to insert a new record.
By shifting the randomization from the simulation to the actual implementation our system is efficient in any application, as worst-case as it may be.
