We present a new algorithm for implementing a multi-word compareand-swap functionality supporting the Read and CASN operations. The algorithm is wait-free under reasonable assumptions on execution history and benefits from new techniques to resolve conflicts between operations by using greedy helping and grabbing. Although the deterministic scheme for enabling grabbing somewhat sacrifices fairness, the effects are insignificant in practice. Moreover, unlike most of the previous results, the CASN operation does not require the list of addresses to be sorted before the operation is invoked, and the Read operation can read the current value without applying helping when the word to be read is within an ongoing transaction. Experiments using micro-benchmarks varying important parameters in three dimensions have been performed on two multiprocessor platforms. The results show similar performance as the lock-free algorithm by Harris et al. for most scenarios, and significantly better performance on scenarios with very high contention. This is altogether extraordinary good performance considering that the new algorithm is wait-free.
Introduction
In multi-thread programming it can sometimes be necessary to be able to update a set of shared variables simultaneously. An example is in the design of dynamic concurrent data structures where several pointers need to be updated at the same time, or when conducting scientific calculations using a graph structure where several nodes need to be updated simultaneously. To do this accurately a transaction is needed, and as the means to do this is not normally directly supported by the system, special constructs are needed. A common approach to implement a multi-variable transaction feature is to employ mutual exclusion; the concurrent execution is serialized by the use of locks such that only one thread is updating the variables at any moment in time. However, locks inherently limit the achievable parallelism and therefore, also significantly degrade the overall performance.
Moreover, mutual exclusion causes blocking and can consequently incur serious problems as deadlocks, priority inversion or starvation. These problems are especially important for real-time systems, and efficient solutions only exist for uniprocessor systems [1] . Some researchers have addressed these problems by introducing non-blocking synchronization algorithms, which are not based on mutual exclusion. Lock-free algorithms are non-blocking, and guarantee that always at least one operation can progress, independently of the actions taken by the concurrent operations. Thus, lock-free algorithms can potentially incur starvation, although the other problems with mutual exclusion are avoided. Wait-free [2] algorithms are lockfree, and moreover guarantee that all operations can finish in a finite number of their own steps, regardless of the actions taken by the concurrent operations. In practice, often due to the extensively needed synchronization, wait-free algorithms are magnitudes more complex to design and offer significantly lower performance than corresponding lock-free. Consequently, wait-free algorithms are normally of special interest to real-time systems, whereas lock-free algorithms' primary benefit is of performance.
The design of non-blocking algorithms benefits from the fact that the hardware of contemporary shared memory systems supports atomic transactions on a single memory word, e.g. single-word compare-and-swap (CAS). The CAS operation can conditionally update a memory word such that the new value is written only if the old value matches a given one. It has been shown by Herlihy [2] that this hardware primitive is universal and thus can implement any shared data structure in a non-blocking manner. Consequently, it is possible to construct a multi-word compare-and-swap operation (CASN) by the use of CAS and other hardware primitives. The CASN operation conditionally updates a set of memory words to a new set of values given that the words currently match a given set of values.
Several non-blocking algorithms suitable for and actual implementations of CASN have appeared in the literature. From a historic perspective, the focus on these papers has changed with time from the theoretic side to more practical implementations. Israeli and Rappoport [3] presented a disjoint-access-parallel algorithm. Shavit and Touitou [4] presented a similar algorithm and also generalized the concept into software transactional memory. 1 Anderson and Moir [5] presented a wait-free algorithm. Attiya and Dagan [6, 7] and Afek et al. [8] presented algorithms focusing on aspects for general multi-object operations. Moir [9] presented a special purpose algorithm which is conditionally wait-free. These algorithms have in common to use recursive helping techniques and using large parts of the memory words for the synchronization. Harris et al. [10] presented a lock-free algorithm that improves performance significantly and increased the useable part of the memory words by using separate descriptor structures for the synchronization information. Ha and Tsigas [11] presented a lockfree algorithm that focused on improving the handling of conflicts by only releasing the necessary amount of words in a reactive manner. In this paper we will primarily focus on those results that are practical and generally applicable on contemporary systems.
This paper presents a new algorithm that implements a wait-free and linearizable multi-word compare-and-swap feature. The algorithm uses greedy helping techniques in order to limit the amount of recursive helping, takes benefits from descriptors to allow a larger part of the memory word to be useable, uses grabbing in a deterministic scheme to resolve conflicts, and allows the given list of addresses to be unsorted when calling the CASN operation.
The rest of the paper is organized as follows. Section 2 describes the system requirements and in Sect. 3, related work is discussed. In Sect. 4 the new algorithm is described. Section 5 shows proofs of correctness for the algorithm. In Sect. 6, some benchmark experiments are shown. Finally, Sect. 7 concludes this paper.
Hardware Synchronization
The shared memory system should support atomic 2 read and write operations of single memory words, as well as stronger atomic primitives for synchronization. In this paper we use the Fetch-And-Add (FAA), Compare-And-Swap (CAS) and the Swap (SWAP) atomic primitives; see Program 1 for a description of the intended semantics. These read-modify-write style of operations are available on most common architectures or can be easily derived from other synchronization primitives [12, 13] .
Related Work
Israeli and Rappoport [3] presented a lock-free and disjoint-access-parallel algorithm for implementing the Read and CASN operations. In this algorithm, each memory word is required to store both the value and the corresponding thread id that "owns" it (i.e., the thread currently involved in a transaction including that memory word). Consequently, the number of bits available for values are limited by the number of threads. Information about ongoing transactions as well as their status are stored in shared variables, with information stored for one transaction per thread. For ongoing transactions it is not possible to directly interpret the current value of a memory word, and hence it is in these situations needed by a Read operation to "help" (i.e., executing them) concurrent CASN operations. Also the CASN operation itself needs to interpret the current value of involved memory words, and hence concurrent CASN operations are helped, which in turn might need to help other CASN and so on, in a recursive manner. As the algorithm is lock-free, helping might also be repeated indefinitely, due to concurrent threads always restarting new transactions whenever the previous has finished. When calling CASN , the argument of addresses to the memory words to be involved in the transaction needs to be in ascending order. Moreover, the algorithm are dependent on the LL/VL/SC primitives, which are not provided in an ideal version in any contemporary or legacy hardware, and thus depends on a subsequent software implementation of these primitives.
Program
Anderson and Moir [5] presented a wait-free algorithm, dependent on wait-free LL/VL/SC primitives. In this algorithm, in contrast to [3] , each memory word is storing the value and is accompanied by an additional memory word storing the thread id and word index of the corresponding transaction. Moir [9] presented a lock-free and simplified version of [5] . However, the Read operation avoids helping completely and the algorithm for CASN applies non-redundant helping [4] . In non-redundant helping, if a CASN operation during helping is detected to be in conflict with another concurrent CASN , the CASN currently being helped is restarted (i.e., undoes all work done and starts over) instead of performing recursive helping. Attiya and Dagan [6, 7] , and later on Afek et al. [8] , improved on previous algorithms from a mainly theoretical perspective in a more general objective and allows the parameter with the concerned addresses to be operated in either ascending or descending order.
Harris et al. [10] presented a lock-free algorithm that improves performance significantly compared to [3] , depends on the CAS hardware primitive, and increases the useable part of the memory words, by only occupying two bits for denoting that the memory word is part of an ongoing transaction. Moreover, the information regarding ongoing transactions are represented per CASN operation invocation instead of per thread basis. These so called descriptors are dynamically allocated, of two different types due to the algorithm being designed in two layers, and must be handled by a separate memory reclamation scheme as e.g. [14] .
Ha and Tsigas [11] presented a lock-free algorithm that improves on [3] by reducing contention and recursive helping. It does this in a reactive manner by continuously measuring the current contention level on the concerned memory words. The actual reactive strategy is then to, before starting to help concurrent CASN operations, release the ownership of enough number of memory words, such that it would maximize the expected overall system-level performance from an economic analysis perspective. Still, the algorithm depends on the LL/VL/SC primitives.
Comparison with the New Algorithm
In resemblance to all previous results, the new algorithm performs the CASN operation in two major phases, i.e., locking versus unlocking of the involved memory words. In resemblance to [10] , the new algorithm stores information in descriptor blocks that are dynamically allocated per operation invocation. However, unlike [10] , locked memory words do not contain real memory addresses to the descriptors, but instead a number that uniquely identifies the related descriptor. In resemblance to all previous results, the new algorithm applies helping in order to guarantee progress. Moreover, like [4, 9] and [11] , the amount of helping is reduced. In resemblance to [9] , the new algorithm stores information in locked memory words about which index in the address list that the memory word adheres to, and thus avoids the related search complexity of finding the corresponding index in descriptors. In resemblance to [9] , the new algorithm avoids helping completely for Read operations. In resemblance to [10] , the new algorithm depends on the CAS primitive. In resemblance to [3, 10, 11] , and in contrast to [5, 9] , the new algorithm occupies a part of the memory word for representing the status of being locked or not.
In contrast to all previous results, the new algorithm only resorts to helping after having checked the suitability (i.e., if this operation should fail later anyway, there is no point in helping other operations) of all other memory words besides the ones already locked by concurrent operations. In contrast to all previous results, the new algorithm can directly lock memory words that are about to be unlocked (as the corresponding operation has decided to fail or succeed) by concurrent operations, without the need for helping. In contrast to all previous results, the new algorithm can resolve conflicts without the mean of helping, by instead changing (grabbing versus giving) ownership of locked memory words in a predetermined order. In contrast to all previous results, the new algorithm does not require the list of addresses for the CASN operation to be sorted in any order. In contrast to [10] , the new algorithm depends directly on the CAS primitive, without any need for another intermediate abstraction layer. In contrast [10] , the new algorithm only needs to occupy one bit of the memory word for locking status. newly defined operations, i.e., an additional software layer of abstraction. In this paper we are defining the Read and the CASN operations with the following semantics:
In an abstract sense, the overall algorithm for performing the CASN operation using fine-grain synchronization is to:
1. Lock all the affected memory words. 2. Check the contents of the memory words and perform the conditional update. 3. Unlock all the affected memory words.
The information of a word's lock-status is naturally stored within the memory word itself (an interesting alternative would be a hash-table). To avoid doing unnecessary locking if the CASN should fail, the checking part of step 2 is done together with step 1, such that we only lock words with matching values. Moreover, in order to be waitfree and linearizable, we need to be able to: (i) perform step 2 in one atomic step, (ii) correctly read the current value even though words might be locked, and (iii) perform helping of operations in the middle of steps 1-3 while other operations are pending. To meet all of these requirements we are using descriptors, see Fig. 1 , that keep all necessary information about a CASN operation in progress, and locking is done by replacing the value with a pointer to the appropriate descriptor. As the descriptor keeps a single variable that indicates the status of the whole CASN operation, it is possible to update this atomically using CAS.
The status of the descriptor is stored in a single memory word and can be one of STATUS_TRYING seq , STATUS_GIVE descriptor , STATUS_FAILED, or STATUS_SUC-CESS. The descriptor also keeps information about the involved addresses, requested old and new values, as well as information about the owning thread. By first checking the memory word and then checking the status (depending on which the interpreted value is either the old or new value) of the possibly corresponding descriptor, it is T1  T3  T2  T3  T2  T3 Memory Words T1 T3 T2 Fig. 2 Example of a conflict between three threads, where all available words have been locked and none of the threads can proceed further. This conflict can be resolved deterministically by changing the ownership of the locks appropriately possible for other concurrent operations to read the current value of a memory word involved in a CASN operation that is in some intermediate step. In order to distinguish between the actual value and a reference to a descriptor one bit is used, leaving 31 bits on a 32-bit (or 63 bits on a 64-bit) memory word for the value.
The concurrent CASN operations will occasionally operate on the same memory words, as can be seen in Fig. 2 , and conflicts in the locking step will eventually occur. The overall idea used in this paper is first letting the whole system of concurrent operations stabilize. This is done by performing helping of the involved operations until no thread can lock any more words. After that, the conflicting words are identified and by a deterministic rule it is decided which thread should grab the necessary (and already locked) words from the other thread. In this way, all conflicts will eventually be resolved, even if involved by more than two threads and interchangeably.
Algorithm Details
In the followings subsections follows a detailed algorithm description of the supported operations on the presented multi-word compare-and-swap functionality. The main data structures are described in Program 2. In the pseudo-code, a set of pre-defined macro definitions are used: Creates a 32/64-bit encoded memory word that represents an integer value. -GET_VALUE(value). Gets the integer value corresponding to the given 32/64-bit encoded memory word.
The algorithm relies on an underlying wait-free memory management implementation based on the memory allocation and reclamation algorithm presented in [15] . For the multi-word compare-and-swap implementation in this paper, local memory heaps per thread basis is used instead of shared heaps for handling the dynamic memory management of descriptors. The reclamation procedure for ensuring the safety of reclaiming a descriptor after use, consists of scanning of the shared announce array (annReadAddr) as well as checking the corresponding address/old/new triples table for matches.
Wait-Free Reading
The algorithms of both the Read and CASN operations inevitably involve the subtask of interpreting the current value of a memory word, concurrently with other operations. Thus, in order for the operations to be designed wait-free, we need to be able to read the current value of a memory word in a wait-free manner, avoiding any (unlimited) repetition in case the memory word was concurrently updated. Moreover, as this reading is done extensively, due to being part of step 2 of the overall CASN algorithm, it should be done as efficiently as possible. Many of the previous results, e.g. [10, 11] , need to apply helping of concurrent CASN operations while reading a memory word currently locked by an ongoing CASN invocation in order to correctly interpret the word's current value. Thanks to wait-free dereferencing and direct index-to-descriptor lookup, see Fig. 1 , this algorithm can avoid that particular type of helping.
For the purpose of wait-free dereferencing 3 we are using the same idea as presented in [15] , which is based on reference counting (thus adding the refcount field to the descriptor structure) to make sure that shared data structures like the descriptors can be safely accessed while the corresponding memory concurrently being reclaimed. The idea stems from the fact that a memory word can store integers as well as pointers. As can be seen in Program 3, each thread has a set of shared variables used for announcing their intentions. The function WFRead returns the contents of the read memory word, and if it was a reference it also returns a pointer to the corresponding descriptor. Before actually reading the memory word in line 5, the address of it is stored in one of the slots of the annReadAddr array in line 4. If the value read is a reference the address to the corresponding descriptor is fetched and its reference count increased in lines 6-8. In case there was a concurrent update of the read memory word after line 4 and the memory of the descriptor possibly being reclaimed for other use, the word in the annReadAddr array has also been concurrently changed 4 before line 9. If so, the new value or descriptor reference is fetched in lines 9-15. This procedure works by forcing the concurrent threads to follow a certain rule whenever performing reclamation. Before reclaiming the memory of a descriptor (suitably after a CASN operation has finalized), all of the threads' annReadAddr variables have to be scanned to see if the announced addresses corresponds to any of the addresses in the descriptor to be reclaimed. If so, the corresponding annReadAddr is updated with a fresh value or descriptor which adheres to the current value of the memory word. Initially, all memory words needs to be initialized with contents that can be properly interpreted as values.
Program 3
Procedure for performing wait-free reading of a memory word's contents. 1 word_t WFRead ( word_t * address , out descriptor_t * desc ) { 
Greedy Helping
To enable scalable performance with an increasing number of threads, the algorithm needs to be as disjoint-access-parallel as possible, e.g., two CASN operations on a disjoint set of memory words should not synchronize. For the algorithm to be wait-free, every operation should eventually terminate if executing a finite number of steps. In general, this implies the need for helping concurrent operations that are in some type of conflict, e.g., in a way such that all threads trying a CAS for a certain update will eventually succeed. On the other hand, for performance reasons, helping should be avoided as much as possible as it essentially means having multiple threads perform each thread's work.
A greedy approach would be to have each thread perform helping only if it is absolutely needed for that thread to progress. According to the defined semantics, the CASN operation can terminate with either failure or success. The decision whether to fail or succeed is taken at a certain step (i.e., when all words have been locked, or one mismatching value has been found) of the execution, and after that decision has been taken the operation will eventually terminate with that result. Therefore, progress for a CASN operation can be seen as getting closer to making the decision. Consequently, when a CASN operation is in conflict with other CASN operations for a word, performing helping is absolutely needed, only if no other steps (e.g. checking if the values of the other words belonging to the same descriptor, match or not) could be executed in order to get closer to decision.
Observation 1 The normally expected execution behavior of concurrent CASN operations, is that if one CASN succeeds it typically means that all concurrent CASN operations that are in conflict will terminate with failure (as the updated memory words no longer match the expected values).
Thus, assume that we could achieve a conflict resolution scheme such that if at least one thread executes its steps, one of the conflicting CASN operations will eventually succeed. Then all other conflicting threads will also eventually terminate and in total the operations will be wait-free. However, there are unlikely, but still possible scenarios, e.g. one thread executing a series of successful CAS(x,0,0),CAS(x,0,0),…con-currently with another thread issuing CAS(x,0,1) which unfortunately never wins the conflict, where one or more threads consequently starves. Hence, as this kind of scenario is unlikely in itself and even more unlikely to repeat infinitely, we define the following assumption:
Assumption 1 When used in the system, the successful CASN operations are always updating the memory words to new values that are different compared to the old, or at least there is a limited series of updates to same values in the concurrent execution history.
The procedure HelpCompareAndSwap is used by both the issuing CASN operation and by operations that need to help that CASN in order to continue with its own operation. The underlying HelpCompareAndSwap procedure is described in Programs 4, 5, and 6. Note that in order to simplify the presentation, details regarding the memory reclamation have been omitted. Hence, appropriate lines, incrementing or decrementing (using FAA) the reference count (refcount) of accessed descriptors, need to be inserted where necessary. In order to avoid recursive helping of the same threads, the HelpCompareAndSwap procedure keeps a local set data structure, avoidList, that keeps track of which previous threads that are in the same recursive call-chain of the current HelpCompareAndSwap procedure. The HelpCompareAndSwap procedure starts with the status of STATUS_TRYING and keeps on trying to lock all the memory words until either it is not possible to proceed further with this descriptor (due to conflicts which could not be resolved without helping some thread in the avoidList) or the status of the descriptor has finalized to either STATUS_SUCCESS or STATUS_FAILED.
While the status is STATUS_TRYING seq , the procedure first tries to lock (using CAS) all words with the correct value (matching with the requested old) that are either free or belonging to a descriptor which has finalized. If a word is found having the wrong value, the status is updated (with CAS) to STATUS_FAILED. If no more words can be locked and there are still words remaining before success, the procedure now tries to resolve the conflict as follows:
-If the conflicting thread that owns the conflicting word belongs to the avoidList, this word is skipped (which means that this call of the procedure cannot decide the final status of the descriptor) and the procedure keeps on with the next conflicting word. -If the helped thread has a lower id than the conflicting thread, the procedure tries to start grabbing by trying to update (with CAS) the status of the conflicting descriptor from STATUS_TRYING seq2 to STATUS_GIVE helped descriptor . If this succeeds it then grabs (with CAS) all of the words needed from the conflicting thread and then updates the status of the conflicting descriptor to STATUS_TRYING seq2+1 (the increased sequence number is required for making the conflicting thread aware that words have been lost). Otherwise it recursively calls the HelpCompareAndSwap procedure for the conflicting descriptor with the currently helped thread added to the avoidList. -If the helped thread has higher id than the conflicting thread, the procedure calls the HelpCompareAndSwap procedure for the conflicting descriptor with the currently helped thread added to the avoidList. It will then try to change the status of the helped descriptor and possibly give away the required words in the same (although inversely in the respect of thread id and descriptors) manner as the above case.
While the status is STATUS_GIVE descriptor2 , the procedure will give (with CAS) all of the words needed by the conflicting thread (given by descriptor2) that is currently locked by the helped thread. It will then update the status of the helped descriptor to STATUS_TRYING seq+1 (where STATUS_TRYING seq was the status before it became STATUS_GIVE).
If the status becomes either STATUS_SUCCESS or STATUS_FAILED, the procedure will perform a clean-up on all locked memory words, updating them (with CAS) to pure values and thus removing the reference to the descriptor, and then terminate.
Read and CASN Operations
Program 7 The Read operation. The design of the Read operation follows directly by the use of the WFRead function. The Read operation is described in Program 7. If the content at the address was a value, the significant bits are returned. Otherwise, the word index (according to the CASN operation) is extracted, and depending on the status of the descriptor, either the requested old or new value at the corresponding index in the descriptor is returned. The CASN operation is described in Program 8. The design of the CASN operation, is to simply initialize a new (which needs to be dynamically allocated and reclaimed in a wait-free manner [15] , from either a shared or local memory pool) descriptor with the arguments. The HelpCompareAndSwap procedure is then called on the descriptor together with an empty avoidList (such that all conflicts will be resolved in this call). Depending on the resulting status of the descriptor the operation returns either true or false.
Algorithm Extensions
As the conflicts are deterministically resolved by using the threads' ids, the algorithm is not fair. Although this is no problem in practice, it is possible to improve the fairness. Either the thread ids can be cycled in a round-robin manner for every CASN operation issued, or the deterministic conflict resolution ordering are dynamically changed in run-time (however, the order needs to avoid the possibility of bouncing).
The limitation of not being able to use all bits of the memory word for values, can be overcome by replacing the value bits with a pointer to a separate memory block (which is dynamically allocated and reclaimed by a separate scheme) of arbitrary size; an interesting technique also used in [16] .
The algorithm can be made purely wait-free (e.g. avoiding Assumption 1) by allowing a CASN operation that have failed to lock or been forced to release any word, to then announce its operation. Other threads are then forced to initially check this announcement and help accordingly.
Correctness
In this section we show the wait-free and linearizability properties of the algorithm.
Wait-Freedom

Definition 1
In order for an implementation of an operation to be wait-free [2] , every operation invocation should terminate within a finite number of its own execution steps, regardless of the interleaving and actions performed by concurrent operations.
Lemma 1 The
Read operation is wait-free.
Proof As the algorithm steps of the used WFRead (see Program 3) and the remaining steps of the Read (Program 7, lines 4-8) operation contains no loops, it follows that the operation must terminate in a finite number of executed steps.
Lemma 2 The CASN operation is wait-free.
Proof Firstly, all reads of memory words are clearly wait-free, as WFRead is used. The only unbounded loop is the main loop (Program 4, lines 4-93) of the descriptor status, which will terminate if either all words are successfully locked (lines 13-20 and 80-86) or there was a word not matching the requested old value (lines 27-29, 66-67, and 73-79). It will retry if there was a concurrent update to any of the words it tried to lock (failed CAS in lines 31 or 69) or if the status is changed (line 6). If there are no conflicts with concurrent CASN operations, there will not be any concurrent updates of the corresponding memory words, and the operation will consequently eventually terminate. If there was a conflict, it is either the case that the conflicting words are grabbed or given. If they are grabbed, the operation will clearly eventually terminate. If they are given, the conflicting operation will be helped. The helping will terminate in a finite number of steps as any further recursive helping is limited by the number of threads and subsequent conflicts will be resolved deterministically. The helped operation will succeed and according to our assumption, the words are updated with new values of which at least one is not matching the requested old values of this operation, and consequently this operation will eventually terminate.
Linearizability
Definition 2 The value of a memory word is interpreted depending on its contents as follows. If it is marked as a value, the interpreted value is the remaining integer bits. Otherwise, if the status of the corresponding descriptor is success, the interpreted value is the requested new value of the corresponding address, and otherwise it is the requested old value.
Lemma 3 The Read operation takes effect atomically.
Proof If the read content of the memory word was marked as a value, the operation takes effect at the read sub-operation (Program 3, line 5). If the content was marked as a reference, there might have been a concurrent update of the descriptor's status before the status was read. If there was a concurrent update of the status to success, the operation takes effect at the concurrent CAS sub-operation (at this point, the memory word must still be locked and consequently follows Definition 2) in line 17 of Program 4. Otherwise, the operation takes effect at the read sub-operation of the memory word (Program 3, line 5), as the contents of the memory word must have been the descriptor's requested old value (as the memory word were successfully locked by the concurrent operation).
Lemma 4
The CASN operation takes effect atomically.
Proof A CASN operation that succeeds also updates the corresponding values. Consequently, it should take effect when the update takes effect according to Definition 2, which is when the descriptor's status is changed to success with the CAS sub-operation in line 17 of Program 4. When a CASN operation fails, there is one memory word not matching the requested old values. Consequently, it should take effect when the non-matching word was read in the same manner as how the Read operation takes effect, i.e., the read sub-operation in line 5 of Program 3.
Definition 3
In order for an implementation of a shared concurrent data object to be linearizable [17] , for every concurrent execution there should exist an equal (in the sense of the effect) and valid (i.e. it should respect the semantics of the shared data object) sequential execution that respects the partial order of the operations in the concurrent execution.
Theorem
Theorem 1 The algorithm implements a wait-free (under given assumptions about execution history) and linearizable multi-word compare-and-swap functionality.
Proof Following Lemmas 1 and 2 given Assumption 1, the algorithm is wait-free. According to Lemmas 3 and 4, the Read and CASN operations take effect atomically at one statement that is executed within their invocations. Consequently, these operations are linearizable according to the given definitions.
Experiments
We have conducted a number of experiments in order to examine the behavior of the algorithm on contemporary multiprocessors in the respect of the number of words, number of threads, and level of contention. The experiments are performed in the form of a micro-benchmark. In this benchmark, each thread is repeatedly trying to atomically increment a randomized (for every repetition, determined and sorted off-line before starting each benchmark so that all implementations can use the very same random patterns) set of words by first reading them by using the Read operation and then updating them by using the CASN operation. For checking correctness, in every repetition the result of the CASN is noted down in local memory, and the whole history The results show in most of the scenarios similar performance of the new wait-free algorithm compared to the lock-free algorithm by Harris et al, as well as significantly worse in some scenarios with medium contention. However, in scenarios with maximum contention (equal number of words as number of memory words) it performs significantly better in majority of the experiments. Moreover, it seems in practice that fairness is not worse than for the other algorithms compared. The advantage in high contention is likely to be thanks to the efficient conflict resolution by grabbing and the greedy helping that enables operations to steal memory words of other just recently terminated operations without having to wait for them to clean up.
To the best of our knowledge, the only previous experimental studies of CASN algorithms in the literature have been done by Harris et al. [10] , and by Ha and Tsigas [11] . The algorithms compared, besides the ones presented in the respective publications [10, 11] , are the algorithms by Israeli and Rappoport [3] and Moir [9] , as they were explicitly expressed to be the only practical alternatives. In [10] , on similar systems as Experiments regarding the experienced fairness done in this study (i.e., the IBM platform), the algorithm by Harris et al. [10] performed from 7 up to 21 times faster than the algorithm by Israeli and Rappoport [3] , for N between 2-8 memory words and with a contention level ranging between 256-4096 memory words. The advantage of [10] compared to [3] decreases with increasing N and higher contention. In [11] , on a MIPS-based SGI system (which has significant similarities to the IBM platform), the algorithm by Ha and Tsigas [11] performed from same up to 9 times faster than the algorithms by Israeli and Rappoport [3] and Moir [9] , for N between 2-8 memory words and with a contention level ranging between 8-16,384 memory words. The algorithms by Israeli and Rappoport [3] and Moir [9] showed similar performance throughout the study. The advantage of [11] compared to [3] and [9] decreases with increasing N and higher contention. Using the information from these two studies together with the experiments performed in this paper, it could hence be concluded that the new algorithm would outperform [3] and [9] not only under low contention but also under high contention.
Conclusions
We have presented a new algorithm for implementing a multi-word compare-and-swap functionality supporting the Read and CASN operations. The algorithm is wait-free under reasonable assumptions on execution history and benefits from new techniques to resolve conflicts between operations by using greedy helping and grabbing. Moreover, unlike most of the previous results, the CASN operation does not require the list of addresses to be sorted prior to the call. Experiments have been conducted on two multiprocessor platforms. Results show similar performance as the algorithm by Harris et al. for most scenarios, and significantly better performance on scenarios with very high contention. We believe that our implementation should be of highly practical interest to contemporary and emerging multi-core and multi-processor systems thanks to both its high performance and its strong progress guarantees. We are currently incorporating it into the NOBLE [18] library.
Interesting future work is to investigate the usefulness of the new algorithm for implementing dynamic wait-free data structures, improving fairness while preserving the overall performance, and to improve the performance of the underlying memory management.
