Abstract-The vector is a fundamental data structure, which provides constant-time access to a dynamically-resizable range of elements. Currently, there exist no wait-free vectors. The only non-blocking version supports only a subset of the sequential vector API and exhibits significant synchronization overhead caused by supporting opposing operations. Since many applications operate in phases of execution, wherein each phase only a subset of operations are used, this overhead is unnecessary for the majority of the application. To address the limitations of the non-blocking version, we present a new design that is wait-free, supports more of the operations provided by the sequential vector, and provides alternative implementations of key operations. These alternatives allow the developer to balance the performance and functionality of the vector as requirements change throughout execution. Compared to the known non-blocking version and the concurrent vector found in Intel's TBB library, our design outperforms or provides comparable performance in the majority of tested scenarios. Over all tested scenarios, the presented design performs an average of 4.97 times more operations per second than the non-blocking vector and 1.54 more than the TBB vector. In a scenario designed to simulate the filling of a vector, performance improvement increases to 13.38 and 1.16 times. This work presents the first ABA-free non-blocking vector. Unlike the other non-blocking approach, all operations are wait-free and bounds-checked and elements are stored contiguously in memory.
issue by using an array segment model to store elements, but this causes elements to be scattered across several arrays.
In this work, we present the design of a concurrent waitfree vector (WFvec) which uses a new methodology to prevent thread starvation. Our design incorporates a variety of function models which allow the vector to gain higher performance in scenarios when only a subset of vector operations are needed. We present a new resize algorithm which facilitates concurrent access and modification during resize, allowing the use of a contiguous memory model. The contiguous model is advantageous as it allows for finer control over the size of the vector and the amount by which it increases its capacity.
We present an abstraction over the underlying memory model which allows the presented vector operations to function with either the segmented or contiguous models. Our design supports random access read (at), write (write), insert (insertAt) and erase (eraseAt) operations as well as appending to (pushBack) and removing from (popBack) the tail of the vector. Unlike a sequential vector, a concurrent vector must maintain correctness when multiple threads are performing pushBack and popBack operations. Achieving this correctness adversely affects the complexity and performance of these two operations. As a result, other known concurrent vectors either sacrifice functionality or safety guarantees to achieve desired performance.
The pushBack and popBack operations are not typically executed concurrently, but rather are used in different phases of an application's execution. For example, a common use-case is to grow a vector by appending (pushBack) values in one phase of execution, and, in the second, processing the values by removing (popBack) or randomly accessing the values. Currently, applications are not able to take advantage of such use patterns and must bear the cost of supporting both operations through the entire execution, instead of just the portions where they are necessary. 1 . Strong scaling is the scenario when the total problem size stays fixed while the number of processing elements are increased. The challenge is to synchronize the work of the processing elements in a correct and efficient manner without "wasting" too many cycles on overhead.
To the best of our knowledge, no other work has proposed the use of different function models based on how functionality requirements change throughout an application. In addition to the traditional model, which allows each vector operation to be executed alongside one another, we provide two alternative models for the pushBack and popBack operations. These alternative models are designed to take advantage of cases where only one of these two operations are executing. We refer to these two alternative models as the compare-and-swap (cas) model and the fetch-and-add (faa) model. In contrast to the cas model where threads compete to append a value or remove the last value, the faa model assigns a position to each thread to perform its operation. Section 6.3 discusses our methodology in designing these models and their compatibility with other vector operations.
A developer would select one model at the start of the application and switch to another model later on. Because the in-memory representation of our vector is independent of these models, there is no cost or limit to the number of times a user can switch between models. Compared to the traditional model, the cas model and the faa model perform 6.48 and 11.95 times as many operations per second, respectively.
This approach in concurrency is inspired by the component aspect of the generative programming paradigm. In this paradigm an optimized software product is automatically generated from a set of reusable implementation components based on a set of requirements [6] . A generative optimizer would be able to automatically select when to use each model based on its understanding of the application and user pragmas. However, such an optimizer is beyond the scope of this work, and the presented design necessitates that the developer explicitly selects which model to use.
We are aware of only one other non-blocking vector [5] , which supports random access read and write, and both tail operations popBack and pushBack. Additionally, Intel's Thread Building Blocks (TBB) library [4] provides a fine-grained locking vector that supports random access read and write and only the pushBack tail operation.
In contrast to the presented vector, neither supports insertAt or eraseAt. Further, the non-blocking vector does not support bounds-checking and does not guard against the ABA problem [7] .
The specific contributions of this work are as follows:
The first vector to provide a wait-free progress guarantee for all operations. This strong progress property makes it applicable for both real-time and many-core systems. Our design imposes few restrictions on the type of elements stored in the vector (Section 4). The concept of using different function models throughout an application's execution to balance performance and functionality.
-We present three versions of the vector's tail operations, each with varying degrees of performance and support for other vector operations. -We discuss how a developer can select a high performing model at the start of the application and switch to a model with more functionality as the use case of the vector changes. This narrows the effect on an application's performance caused by supporting a wide range of operations to the portion of execution that requires those operations.
A wait-free copy-over algorithm that allows elements to be stored contiguously on a single array. Both the non-blocking vector and TBB's concurrent vector use a series of array segments which break the contiguous property. The first non-blocking vector to support bounds checking. This prevents a thread from removing a value from an empty vector or performing a random access operation on a position that is not within bounds. Designs which do not support bounds checking exhibit undefined behavior in these scenarios. The first non-blocking vector to support multi-position operations, such as insertAt and eraseAt.
The design of these operations can be used to build other multi-position operations as well. One practical multi-position operation we propose is a map operation to atomically update each value in the vector. A novel descriptor-based recursive helping technique which allows threads to help complete operations which may span an arbitrary number of words across several memory addresses. In a micro-benchmark, the presented design achieves higher performance than the best available concurrent vectors.
-On average, in scenarios involving only pushBack operations, our design improves performance by a factor of 7.27. -On average, in scenarios involving pushBack and random access read operations, our design improves performance by a factor of 4.4. -Averaging all test scenarios, our design improves performance by a factor of 3.26.
BACKGROUND
Algorithms designed for concurrent execution are susceptible to many dangers. These dangers include: dead-lock [1] , live-lock [8] , and thread starvation [1] .
Dead-lock. A situation in which two or more competing actions are each waiting for the other to finish, and thus neither ever does [1] . Live-lock. Zsimilar to a dead-lock, except that the states of the processes involved in the live-lock constantly change with regard to one another, but none are progressing [8] .
Starvation. The case some threads are making progress, but one or more threads are being perpetually blocked from making progress [1] . Non-blocking algorithm designs avoid mutual exclusion in an attempt to provide scalable performance and to prevent such dangers. If an algorithm is wait-free, then it is free from all three of the aforementioned dangers. However, if it is obstruction-free or lock-free it is susceptible to starvation, and if it is obstruction-free, it may also become live-locked [1] .
The design of non-blocking algorithms, is predominantly based on the compare-and-swap operation. This operation atomically compares the contents of a memory address with an expected value, and if they match, it assigns a new value to that address before returning the read value. This operation enables a developer to reason about the contents of a memory address before and after an operation. However, it also introduces the ABA problem, which may lead a thread to act incorrectly. ABA is not an acronym, but a description of an error that can occur in a concurrent environment. For example, a thread operating on an address, P , may act incorrectly if the value of P transitions from a value, A, to another value, B, and back to the original value, A, without the thread perceiving P to have held the value B [1] . Section 6 provides a specific example of how ABA can occur and how our design prevents it.
Linearizability is a correctness condition that allows a developer to reason about the correctness of a concurrent object. An operation is linearizable if it appears "to take effect instantaneously at some moment between its invocation and response" [1] . It allows for the construction of a valid sequential history from an initial state to the current state given a set of concurrent operations. This sequential history is constructed by ordering all operations by their invocation and response events. If two operations have overlapping events, then they are ordered by the point at which they linearized with respect to each other.
A vector is a sequence container which stores elements contiguously in memory. This allows for elements to be efficiently accessed using indices. New elements can either be appended to the end of the vector with Oð1Þ amortized complexity or inserted at an arbitrary position with OðNÞ amortized complexity. Likewise, elements can be removed from the end of the vector with Oð1Þ amortized complexity and erased from an arbitrary position with OðNÞ amortized complexity. If the number of elements exceeds the capacity of a vector, a new region of memory is allocated and the elements are copied over.
RELATED WORK
We are aware of only one other non-blocking concurrent vector in literature which presents a lock-free vector (LFvec), that supports concurrent read, write, pushBack, and popBack operations [5] . In this design, a single shared object is used to serialize popBack and pushBack operations. To complete either operation, a thread must first acquire the shared object. If this object is already owned, the thread must execute a helping procedure. In contrast to the presented approach, this design can not guarantee a tail operation will complete if new operations are continually added to the system. Random access read and write operations are able to execute concurrently without acquiring this shared object; however, there is no mechanism to prevent a thread from accessing a position that is not in bounds. This can lead to the case where a thread reads or writes to a position that is out of bounds resulting in undefined behavior. This lock-free approach puts the burden of bounds checking on the user, and it is unclear how a user can safely perform bounds checking. For example, if a thread were to check the size of the vector, another thread can immediately pop one or more elements. The first thread would be unaware of this and could access an out of bounds position. Further, this design is prone to the ABA problem [5, Section 3.5] , which may lead to elements not being stored contiguously.
Outside of literature, Intel's Thread Building Blocks library presents an open source concurrent vector that supports a subset of the operations provided by the C++ Standard Library [4] . This vector supports semi-bounds checked random access operations and the pushBack operation, but does not support the popBack operation. pushBack is performed by fetching and adding one to the size variable and writing the value at the fetched position. In contrast to the presented approach, their methodology does not provide a mechanism to distinguish between a position thats holds a value and a position that has been assigned but not written to, allowing for the case where a thread may operate on obsolete data or a partially written value.
In contrast to the design of the C++ Standard Library vector, which provides a guarantee that elements are stored contiguously in memory, these designs use a series of array segments, allowing for concurrent growth without having to move elements to increase capacity. The design of our operations are independent of the underlying memory model and can be used on either the contiguous or the segmented memory model. We examine the pros and cons of these two memory models in Section 5, and present a wait-free copy-over procedure to support the contiguous array model.
ALGORITHM OVERVIEW
In the following sections, we present our implementation of the vector tail operations (pushBack and popBack), random access operations (at and cwrite), and multi-position operations (insertAt and eraseAt). For each operation, we provide a description and an informal reasoning of correctness and wait-free progress.
The following algorithms use descriptor objects [9] to indicate that a operation is in progress. Descriptors are special objects in the vector which denote that an ongoing operation is affecting the element at the given position. A descriptor may be placed at an position where another logical value already exists. Each descriptor object we present contains one or more data members and a set of functions that include a complete and a value function. The complete function allows a thread to complete the operation that placed the descriptor object. Upon its return, it guarantees that the descriptor object has been removed. The value function allows a thread to determine the logical value of the address holding the descriptor object. If the descriptor object's operation is in progress then the value that the descriptor object replaced is returned; otherwise, the result of the operation is returned.
Our approach requires support for the atomic primitives compare-and-swap, fetch-and-add, and fetch-and-or (fao). These atomic operations are available in C++11 and on most architectures. When used to store machine-word sized objects, our design reserves the least significant bit (LSB) for type identification to distinguish between descriptors and elements. For brevity, this bit marking has been omitted from the pseudo code. In general, if a value is determined to be a reference to a descriptor object, then the bitmark is removed on the local copy before dereferencing it. If the contiguous memory model is used, an additional bit is required to support the concurrent resize procedure presented in Section 5. The constants we use to represent empty positions and uninitialized positions impose no addition restriction as they use the LSB. When used to store objects that are larger than a machine-word, our design requires the use of a multi-word compare-and-swap algorithm [10] .
In the tested implementation, we incorporated a memory reclamation scheme based on hazard pointers [11] and reference counting [12] to determine if an object is safe to be reused. The specific details of this are not within the scope of this work.
Wait-Free Progress
A key challenge to achieving wait-free progress is that there is no guarantee that a thread will execute a successful cas operation or observe a value indicating that it should return. For example, a thread may update a value with such frequency that it causes another thread to always fail in its cas operation. To guard against this potential danger, we have implemented a progress assurance scheme based on the announcement table presented by Herlihy [13] . Each thread is required to check for an announcement before commencing its own operation. If a thread reads an announcement, it acts according to the contents of the announcement. We use the methodology described by Kogan [2] to reduce the cost of this check to Oð1Þ. We use a novel descriptor-based recursive helping technique which allows threads to help complete operations which may span an arbitrary number of words across several memory addresses. As shown in Algorithm 9 Line 25, a thread makes an announcement when a fail threshold (LIMIT) has been reached. The announcement function (announceOp) internally calls the help complete function of the passed operation record. As result, upon its return the operation is complete.
After making an announcement, only ðthreadCount À 1Þ 2 more operations can occur before the operation is complete. One way to execute an operation from an announcement table is to use the fast-path-slow-path methodology [2] . In this design, a thread examines the state of a shared object, checks the status of the operation record related to it, and executes the operation if it has not yet been completed. For simple operations, this avoids data races that could occur when multiple threads are executing the same operation. In more complex operations, such as those found in the vector, this procedure has several shortcomings. If it is uncertain which memory words will be affected by an operation, the same operation may be successfully executed on multiple memory locations, and values could be reused, leading to the ABA problem. An example of this occuring is shown in Section 6.
We present an association model that overcomes the limitations present in [2] . In this model, each operation record contains a control word which holds a reference to a descriptor object. This reference is initially NULL, and a cas operation is used to assign a value to it. Once the control word is assigned a value, the descriptor and the operation record are said to be associated. By design, it is not possible for the control word to change after it has been set. By using this association, it is possible to determine whether a descriptor object was placed correctly as part of an operation, or if it was placed in error due to thread delay. We use the association model in the construction of several of the vector's operations.
When a thread tries to access a vector element and instead reads a descriptor object, the thread will often need to perform a helping routine. While executing this helping routine, the thread may need to recursively complete the operations for other descriptors. We prevent indefinite recursive helping by requiring a thread to store a reference to its own operation's control word. Before helping to complete another thread's operation, it checks whether or not the value of its control word has changed to a non-NULL value. If so, some other thread has completed the operation, and the executing thread can return back to its own operation. A thread can also return back to its own operation if it reaches a recursive depth greater than the number of executing threads. This indicates that some operation it is currently helping has been completed.
For each vector operation, we describe the design of the operation record and any challenges faced while implementing it. For brevity, we omit the specific implementation of these operation records; however, the tested implementation of this algorithm is available upon request. The following algorithms are wait-free because they are composed of only wait-free functions and, using the progress assurance scheme, a limit can be placed on the number of times a thread can be prevented from making progress.
DATA STRUCTURE REPRESENTATION
The traditional in-memory representation of a vector is a single contiguous region of memory. When this region can no longer accommodate new elements, a new region must be allocated and the elements copied over. This design provides efficient random access operations and high data locality. However, the known concurrent vectors use a segmented memory model to store the vector's elements on a series of array segments. This model does not require elements to be copied over during a resize, but instead allows a thread to append a new array segment to a list of segments. Consequently, a thread must access this list of segments before it can access an element.
Known implementations of the array segment model depend on a static array of references to array segments. 2 They provide Oð1Þ access by requiring that the length of the first array segment be a power of two and the capacity of each subsequent array segment must be twice the capacity of the previous array.
This requirement allows a vector index to be mapped, in constant time, into a segment index and an offset into that segment. At a high level this is performed as follows: Let 2 Y be the length of the first array segment. To account for this length, the vector index is incremented by the length of the first segment (2 Y ). The segment index is X À Y where X is the largest non-negative integer, such that 2 X is at most the incremented vector index. The segment offset is equal to the incremented vector index minus 2 X . In Algorithm 1 we take advantage of the binary representation of integers to quickly compute these two values using bitwise operations.
2. This static array imposes no significant restriction on the size of the vector, as a segment array of size 64 can accommodate at least 2 64 elements.
To support the contiguous element model, we include a new wait-free resize procedure. To our knowledge, there are no other known wait-free resize procedures for this contiguous model. The use of this resize procedure requires that a bit of each element be reserved. This allows a thread to distinguish between a valid element and one that yet to be copied into the new internal Contiguous object. The presented algorithm focuses on increasing the vector's capacity and omits details related to decreasing the capacity. However, the presented methodology can also be applied to reduce the vector to a specific size.
A thread attempting to resize the vector allocates a new Contiguous object with the specified capacity and a reference to the old Contiguous object. All values in the new Contiguous object up to the capacity of the old object are initialized to a constant that indicates it has not been copied (NOTCOPIED). The remaining positions are initialized to a constant representing the position does not hold a valid value (NOTVALUE). Note the two least significant bits of these two constants are marked and they do not introduce any additional value restrictions. Next, the thread attempts to replace the reference to the old Contiguous object with a reference to the new one (Algorithm 3 Line 2). If this fails, it indicates that the vector has been resized by some other thread. In this event, the function returns the reference to the new Contiguous object. Otherwise, it attempts to copy each element from the old Contiguous object to the new one.
To copy an element from one Contiguous object to another, the thread uses an atomic or operation to set the resize bit to 1 (Algorithm 4 Line 3). This prevents any thread from changing the value, as any future cas operations will fail and any subsequent reads will observe the bitmark. Next, the thread replaces the NOTCOPIED value on the new Contiguous object with the value from the old object (Algorithm 4 Line 5). Any thread that sees a NOTCOPIED value would perform the same procedure before continuing its operation (Algorithm 5 Line 4). Similarly, if a thread reads a value with its resize bit set to 1, it will call the getSpot function to get the address of the element on the new vector. This allows any thread performing an operation during a resize to copy only the elements pertinent to its operation.
Depending on the use case of the vector, one representation may be more suitable than the other. For example, systems with limited memory resources may choose the contiguous model as it supports resizing by a specific amount. This is in contrast to the segmented model which must resize in a specific manner, which could allow for a significant amount of wasted space in the event a large expansion occurs when there are a small number of elements left to be added. For other use cases, the resize procedure provided by the contiguous array model could be too costly, and the memory utilization of the segmented model may not be a concern.
The presented vector operations are applicable for either memory model. In the description of the following algorithms we abstract the details of underlaying memory model using the getSpot method which returns a specified position. Internally, getSpot calls the vector resize procedure if the position is beyond the vector's current capacity.
TAIL OPERATIONS
This section presents our implementation for the vector tail operations popBack and pushBack. Our design uses the techniques presented in [10] to maintain contiguous elements. We define the last element of the vector as any value (other than NOTVALUE) followed by NOTVALUE. We define the tail position as the position following this element. Because elements are contiguous, there can only be one of each.
We prevent data races between concurrent operations which modify the size of the vector by using modifications to the last element and the tail position as a natural syncronization point between operations. This ensures that concurrent operations do not break the contiguous element property by, for example, popping from the middle of the vector, or pushing to a position that is not the tail position. To this end, popBack must modify both of these locations.
pushBack must modify the tail position and re-examine the preceeding index to ensure that the tail did not move after this modification. Fig. 1 , demonstrates the operation of popBack. Each step in the figure represents the state of the vector after a single atomic operation. A popBack removing an element from position n must change the values in positions n and n þ 1 to NOTVALUE. The NOTVALUE held at the tail position must be changed to prevent the operation from removing an element other than the last element.
popBack uses two types of descriptor objects: PopDescr and PopSubDescr. PopDescr consists solely of a reference to a PopSubDescr (child) which is initially NULL. PopSubDescr consists of a reference to a previously placed PopDescr (parent) and the value that was replaced by the PopSubDescr (value).
The first step in performing a popBack (shown in Fig. 1b) is to place a PopDescr at the tail position (Algorithm 6 Line 9). A correct placement effectively prevents the tail from moving during the remainder of the operation. Next, a PopSubDescr is constructed containing a reference to the placed PopDescr and the value of the last element in the vector. The last element of the vector is replaced with a reference to this PopSubDescr (Algorithm 7 Line 13; Fig. 1c ). These two descriptors are associated by using a cas operation to set the child member of the PopDescr to the placed PopSubDescr (Algorithm 7 Line 14; Fig. 1d ). Below we show how this association ensures correctness and prevents ABA. After associating the two descriptors, the popBack has linearized, and the two descriptors are replaced by NOTVALUE (Algorithm 7 Line 16 and Line 19; Figs. 1e and 1f) . At this point, the operation has completed, and the popped value can be returned to the caller. Fig. 2 , demonstrates the operation of pushBack. A pushBack appending an element must replace the NOTVALUE at the tail position with the value being pushed. This operation uses a single descriptor, PushDescr, which contains the value to be pushed and a state member. This state member is initially UNDECIDED and can transition to either FAILED or PASSED. The need for this state is described in Section 6.1.
Similar to popBack, the first step in a pushBack is to place a PushDescr at the tail position of the vector (Algorithm 9 Line 14; Fig. 2b ). To ensure that the PushDescr was placed at the tail, position n À 1 is examined. If it contains NOTVALUE, the PushDescr was placed incorrectly, and its state is set to FAILED. Thus, the operation must be tried again. Otherwise, the state member of PushDescr is set to PASSED. The pushBack is completed by replacing the PushDescr with the value being pushed.
We demonstrate how the use of association in popBack ensures correctness and prevents ABA. Say an operation executing concurrently with a popBack has read a PopDescr at position n þ 1 and has determined that it must help complete it by calling PopDescr::complete.
Before calling the load in Algorithm 7, Line 6, the helping operation is delayed. While delayed, the popBack is completed, all descriptors are removed, and a pushBack occurs, placing V n at position n. When the helping operation resumes, it successfully places its PopSubDescr, despite the original operation having finished. The helping operation will then attempt to associate the newly placed PopSubDescr with the PopDescr. This association will fail because the PopDescr is associated with a different PopSubDescr. The helping operation will correct this error by replacing the PopSubDescr with the value it had replaced (V n ). This shows how ABA could occur with concurrent pushBack and popBack operations, and how our association prevents it from leading to incorrect behavior.
Correctness
We show that pushBack behaves as expected and the result is linearizable by examining the order of its operations. A value is only appended by replacing a PushDescr with that value. Since each descriptor object is used once and is placed in a single position, its internal value can only be appended once. After a PushDescr replaces NOTVALUE, the value to replace it with must be determined. This is done by examining the state member of the PushDescr. If it is unset, the thread will attempt to set it using cas. This guarantees that state will be set exactly once. The state is set to PASSED if the position before the PushDescr holds an element. In this case, it is impossible for that element to be removed before the pushBack operation associated with the PushDescr has been completed. This is because removing the element requires placing a descriptor object at the following the position. However, that position already holds a PushDescr, which can only be removed after its operation has been completed. As such, it is impossible to incorrectly set the state to PASSED. This guarantees that if a thread replaces a PushDescr with a value, it did so while maintaining the contiguous element property of the vector. By examining the state of the PushDescr, a thread knows if its operation succeeded or failed and whether it needs to retry. This design guarantees that the value being appended will be added to the vector exactly once and that it will not break the contiguous element property.
The point at which a pushBack operation linearizes is after correctly placing a PushDescr (Algorithm 9 Line 14). However, this is not realized until the thread has determined that it was placed correctly (Algorithm 10 Line 3). Any thread that reads a PushDescr placed at the tail determines its logical value to be the value being pushed.
We demonstrate the correctness of the popBack operation by showing that it removes a single element and that element was the last element in the vector. For the sake of contradiction, let us assume it is possible to remove an element that is not the last element. By the semantics of the presented algorithm, a value can only be removed by first replacing it with a PopSubDescr. If the assumption is true, it implies that the PopSubDescr was not placed before the PopDescr or that the PopDescr did not replace NOTVALUE. Either scenario violates the semantics of the presented algorithms, so our initial assumption must be incorrect, and the element removed must have been the last element. Now let us assume more than one value was removed during a single popBack operation. This implies that multiple threads read different non-NULL values when loading the child member of the PopDescr. This contradicts the semantics of the presented algorithms because child can only transition from NULL to non-NULL. As such, the value of a PopDescr's child member can only ever be one non-NULL value, so this assumption must also be incorrect.
The linearization point for popBack is the cas operation used to associate the PopDescr and PopSubDescr (Algorithm 7 Line 14). After this point, any thread reading the value of the address holding those descriptors will return NOTVALUE.
Wait-Freedom
This section describes how we apply the progress assurance scheme described in Section 4.1 to the popBack and pushBack operations. The while loops in Algorithms 6, 9, 7, and 10 typically terminate upon reading a value that indicates the operation should fail or performing a successful cas operation. In the rare event where descriptor objects are continually removed, a failure threshold causes the loop to be terminated and the operation to be completed by an operation record.
The operation record for a pushBack operation consists of a reference to a PushDescr. The operation record for a popBack operation consists of a reference to a PopDescr. Both references are initially NULL. These operations are completed in a similar manner as described above with the following differences. The PushDescr and PopDescr objects contain a reference to an operation record in execution. If a descriptor is placed correctly, a thread attempts to associate it with an operation record. Then, the descriptor is replaced with the result of its value function. If a descriptor is correctly associated, its value function returns the result of the operation. Otherwise, it returns the value originally replaced by the descriptor.
The while loops in Algorithms 6, 9, 7, and 10 terminate when the operation record is associated with a descriptor object. After an operation record has been placed in the announcement table, only a finite number of operations can occur before either the operation has been completed or all other threads are helping to complete the operation. If all threads are helping to complete the same operation, some thread will successfully place a descriptor object and associate it with the operation record. This allows us to derive an upper limit on the number of iterations that a thread will perform on the while loops. As such, these algorithms are wait-free because an upper limit can be derived for all loops contained within and they are composed of only wait-free functions.
Alternative Tail Operation Models
Unlike the lock-free vector, our pushBack and popBack model supports bounds checking, which prevents the case where a thread may incorrectly read the contents of a position greater than the current size. However, it does share a common deficiency with LFvec which is the inability to diffuse and/or reduce the contention caused by supporting these opposing operations. It is in our opinion, that unless an auxiliary structure (such as Herlihy's stack exchange [1] ) is used or the vector's memory representation is significantly transformed, it is unlikely for both operations to co-exist and exhibit strong scaling in a many core system.
To address this deficiency we present two alternative models of each operation that exhibit better scaling and performance. These alternatives are designed to be executed in isolation from their opposing tail operation. We believe that a number of algorithms which use vectors can take advantage of these alternative implementations during phases where elements are solely being pushed or popped from the vector.
Compare-and-Swap Tail Operations
Algorithms 11 (cas_popBack) and 12 (cas_pushBack) present an implementation of pushBack and popBack in which other threads can safely access elements while elements are being pushed or popped from the vector. In contrast to Intel's Thread Building Block's vector (TBBvec), our design guarantees that if a thread reads the value at a position, either a valid value is returned, or false is returned indicating it is out of bounds.
The cas_pushBack operation loops until it replaces NOTVALUE with the value being pushed; after which it increments the size variable. The cas_popBack operation loops until it replaces a value with NOTVALUE; after which it decrements the size variable. If the size variable is less than zero, it returns false. As discussed in Section 4.1, the number of times a thread retries an operation is limited by our progress assurance scheme.
The linearization point of these operations is the successful execution of the cas operation. We sketch an informal proof of correctness by examining how the vector is modified. By the nature of the operations executing, the vector's tail can move in a single direction. As such, we can reason that if we observe an element followed by NOTVALUE, and we succeed at applying the operation, it is impossible for our operation to break the contiguous element property. If we assume a thread incorrectly removed an element that was not at the tail, then this implies that the tail moved as a result of the addition or removal of one or more elements. However, this is in contradiction to the restriction we placed on the types of operations allowed to execute concurrently.
Fetch-and-Add Tail Operations
Algorithms 13 (faa_pushBack) and 14 (faa_popBack) present an implementation of pushBack and popBack in which a thread is assigned a position to complete its operation. This assignment is accomplished by performing a fetch-and-add on the size variable which adjusts the size monotonically. This allows the algorithms to be implemented without loops or descriptor objects. Additionally, due to the short amount of time a thread accesses the size variable, it is unlikely that it becomes a bottleneck. This approach mirrors that used in TBBvec to append new elements. However, in contrast to TBBvec, our random access operations can be designed to detect if a position has been assigned but not fully initialized (See Section 7).
RANDOM ACCESS OPERATIONS
This section presents our implementation for the random access read (at) and modification (cwrite) operations. These implementations are bounds checked on both the capacity of the vector and the tail of the vector. Any attempt to access a position that is not within the bounds of the vector causes the function to return false. To identify whether or not a position is within bounds, a thread first checks if the position is less than the capacity. Then it checks whether or not the logical value of a position is NOTVALUE. If it is NOT-VALUE, that position is considered out of bounds.
At
Algorithm 15 presents our implementation of the vector's at operation. Depending on the types of operations currently executing, a thread may attempt to access a position that holds a descriptor object. We handle this by calling the value function which returns a value based on the state of the operation that placed the descriptor. If the operation is in progress or the descriptor was placed in error, the value of the address before the descriptor was placed is returned. Otherwise, the result of the operation is returned. The exact implementation of this function is specific for each descriptor object. Depending on the complexity of the operation, some amount helping could be performed.
The at algorithm is wait-free because it contains no loops and it only calls other wait-free functions. It linearizes, in respect to all other operation, on the loading of the value at the specified position (Algorithm 15 Line 2), unless the value is a descriptor. If it is a descriptor, it linearizes upon determining its logical value (Algorithm 15 Line 4).
Conditional Write
Algorithm 16 presents a wait-free implementation of a vector-specific cas operation (cwrite). A cwrite operation is used to conditionally update the value at a specified position if it matches an expected value. The use of descriptor objects can cause a cas operation to fail, even if the logical value of the address matches the expected value. Our implementation overcomes this by detecting if the read value is a descriptor object, and, if so, it removes that descriptor object. When a non-descriptor object is read the thread compares the expected value with the read value. If they match, the thread attempts to replace the read value with a new value. If this is successful, the thread returns TRUE (Algorithm 16 Line 4). Otherwise, the value at the specified position is re-examined. If the read value does not match the expected value, the operation linearizes on the load and FALSE is returned (Algorithm 16 Line 8).
To achieve wait-free behavior, we use an operation record (WriteOp) and a descriptor object (WriteHelper). A WriteOp holds the position to operate on, the expected value (old), the new value (new), and an atomic reference to a WriteHelper (child). A WriteHelper holds a reference to a WriteOp (parent) and the value it replaced (value). A thread replaces the value at the specified position with a reference to a WriteHelper and attempts to associate it with a WriteOp, after which it is replaces with its logical value. The logical value is determined as follows:
If it is associated with the WriteOp and its value member matches old, its logical value is new. Otherwise, its logical value is its value member. The cwrite operation linearizes, with respect to all other operation, on the successful cas operation (Algorithm 16 Line 4), unless the operation record was used. If so, it linearizes when a thread associates the WriteOp with a WriteHelper.
Interaction with Fetch-and-Add Tail Operations
Using these random access operations in conjunction with the fetch and add based tail operations may lead to unexpected behavior. For example, if a thread performing a faa_pushBack has been assigned a position, i, at which to place its value, but has not written its value yet, then this position would be considered out of bounds. If subsequent faa_pushBack operations assign values to position greater than i, i will still be considered out of bounds until it has a value written to it. It is possible to modify the at and cwrite operations to include additional logic to detect such cases. However, it is not within the scope of this work to propose what actions should be taken in this scenario.
MULTI-POSITION OPERATIONS
This section provides an overview of how we implement wait-free linearizable multi-position operations, such as the shift operations insertAt and eraseAt, and the map operation. We are aware of no other non-blocking or finegrained locking vectors that support such operations. In a sequential vector, the cost of performing a shift operations is very high and in a non-blocking vector it is even greater. This functionality is provided for the sake of completeness of the vector's API.
The design of our multi-position operations depend upon bi-directional association model described in Section 6. For each multi-position operation, we define an operation record that describes the action to be performed and a helper descriptor object (MPdescr) that is placed during the execution of the operation. In the majority of operations, these helper objects are used to construct a doubly-linked list, where each item in the list corresponds to a position in the vector that is being modified by the operation.
When designing operations that place MPdescr objects, it is important to consider the order by which such objects are placed. If care is not taken, two or more concurrent operations could dead-lock in the event a cyclic dependency exists between them. To prevent this, we make the design restriction that multi-position operations must place descriptors in ascending order of index. This ensures that two concurrent multi-position operations can not have a cyclic dependency. We choose ascending order, as opposed to descending order, to prevent a lengthy operation from significantly delaying a tail operation. If descending order was chosen, then the entire multi-position operation must be completed before any further tail operations could begin.
Complications arise, however, because this choice of placement order is opposite of that by which pushBack and popBack place descriptors. Consider the case where an insertAt operation is placing descriptor objects in ascending order on index until it has replaced a NON-VALUE. Concurrently, a pushBack operation is occuring at the tail. When the insertAt operation encounters the pushBack operation's descriptor, it calls that descriptor's complete function. The pushBack descriptor's complete function must read the logical value at the previous position. This will be a descriptor for the insertAt operation, and to get its logical value, its complete function must be called. This complete function must, again, read the same pushBack and call its complete function. This cyclic dependency between descriptor placement order thus results in unbounded recursion.
This danger is avoided by including type checking in the complete function of a multi-position operation's descriptor. If it encounters a descriptor object known to have a cyclic dependency, it will perform special logic to resolve the dependency before calling complete. If the descriptor is a PushDescr (as in the above example), its state is set to PASSED. If the descriptor is a PopDescr, its child member is set to FAILED with a cas operation.
Algorithms 17 and 18 present an example of how we constructed the shift operations insertAt and eraseAt. The two operations differ in only two aspects: how they modify the size variable (Line 10), and how they determine the logical value of the helper objects used in completing the operation (Line 1). These differences come from the task being performed (i.e., inserting versus removing an element).
ShiftOp is a multi-position operation record, and ShiftDescr is type of the helper objects placed by it. The shift operation is executed by iteratively replacing values on the vector with references to ShiftDescrs until it has replaced a NOTVALUE. Next, it iteratively replaces the ShiftDescrs with their logical values (Algorithm 20). Their logical values are determined by the ValueGetter function, which is defined in Algorithms 17 and 18.
An element can not be inserted into or removed from a position that is beyond the size of the vector. To prevent a thread from attempting this, Algorithm 19 contains specialized logic when placing the first ShiftDescr. To place the first ShiftDescr, the value of the specified position is examined:
If the value is NOTVALUE, the ShiftOp's next member is set to FAILED using a cas operation (Algorithm 19 Line 14).
If the value is a descriptor type, the descriptors complete function is called (Algorithm 19 Line 12) . Otherwise, the value is a Value type and a cas operation attempts to place a ShiftDescr (Algorithm 19 Line 17).
If successful at replacing the value, a subsequent cas operation attempts to associate the ShiftDescr with the ShiftOp (Algorithm 19 Line 18). As in previous operations, if the association fails, the ShiftDescr was placed after the operation has already been completed.
The remainder of the ShiftDescrs are placed in a similar manner, with the following differences:
The placed ShiftDescr is associated with the most recently placed ShiftDescr instead of the ShiftOp (Algorithm 19 Line 44). If the value to be replaced is a PushDescr or PopDescr, the aforementioned dependency resolution logic is used before calling its complete function. The iteration terminates when a ShiftDescr has replaced NOTVALUE (Algorithm 19 Line 27).
Correctness
The correctness of these algorithms can be demonstrated using the same logic used in Section. 6.1 and in [10] . For brevity, we provide only a short overview of the correctness of these algorithms.
The insertAt and eraseAt operations replace the value held at indices greater than or equal to the specifed positon. The semantics of Algorithm 19 ensure that the values at those indices are each replaced by a ShiftDescr. The linearization point of a shift operation in respect to another operation is the placement of a descriptor object at the lowest common index. The operation that places the descriptor first will occur first. The other operation will either see a descriptor object at the index or observe the effects of the first operation. If it sees a descriptor object, it will help complete the operation that placed it. The association model ensures that if multiple threads are placing ShiftDescr objects, the effects of the shift operation will not occur twice.
The insertAt and eraseAt operations update the value held at indices greater than or equal to the specified position with correct values. The ShiftDescrs prevent the value at each index from changing during the operation. Any write operation that attempts to modify the value must first help complete the shift operation. The doublylinked list created by the ShiftDescrs provides access to the information necessary to determine the logical value of each ShiftDescr. This allows any operation to replace a ShiftDescr with its correct logical value.
Wait-Freedom
This algorithm uses the announcement table described in Section 4.1 to achieve wait-freedom in the event a thread is unable to place a ShiftDescr in a reasonable amount of time. This places a limit on the number of times a thread executes the three while loops in Algorithm 19. By design, an operation is complete once the announceOp function returns. An announced operation is helped by calling the helpComplete function. This function differs from Algorithm 19 in that it does not have a fail counter. As described in Section 4.1, there exists an upper bound on the number of operations that may occur before all threads are helping to execute the announced operation. When this upper bound is reached, the only vector modification that can occur is the placement of ShiftDescrs for the announced operation. This implies:
If the load at Algorithm 19 Line 10 or Line 34 returns a reference to a descriptor object, then it must be a ShiftDescr placed for the announced operation. If the cas at Algorithm 19 Line 17 or Line 43 fails, it must be a result of another thread placing a ShiftDescr for the announced operation. In either case, this implies that the operation is making progress towards completion. Because there is a finite number of values to be replaced and a finite number of operations that may occur before a value is replaced, it is guaranteed that a value will be replaced in a finite number of steps. Thus, this algorithm is wait-free.
EXPERIMENTAL EVALUATION
In this section we compare the performance of the presented wait-free vector to that of TBB's vector (TBBvec) [4] and the lock-free vector [5] . Because TBB's vector does not support popBack, we compare it with our wait-free algorithm's alternative tail operations: fetch-and-add (WFfaa) and compare-and-swap (WFcas). These, like TBB, do not incur the additional performance penalty related to supporting concurrent popBack operations. We examine how the different designs are affected by thread contention, complexity, and the types of operations executing.
Testing Methodology
Our testing procedure consists of a main thread that initializes the vector with an initial capacity of one million and pre-fills it with 10,000 elements. Then it spawns a set of worker threads and when all threads are ready, it signals the start of execution and sleeps for 5 seconds. Upon waking, it signals the end of the execution and determines the number of operations per seconds performed. Each worker thread repeatedly executes vector operations according to a specified distribution until it has a received a signal to stop.
All tests were conducted on a 64-core ThinkMate RAX QS5-4410 server running Ubuntu 12.04 LTS. It is a NUMA system with four AMD Opteron 6272 CPUs (16 cores per chip @2.1 GHz) and 314 GB of shared memory. For each algorithm tested a separate executable was compiled with GCC 4.8 and the options Àstd=c++11 and -O3. The following performance results are the average performance over 10 executions. Fig. 3 presents the performance of each algorithm when using different ratios of random access read operations to pushBack operations. As shown in graph 3a, both TBBvec and WFfaa outperform other implementations when there are solely pushBack operations executing on the vector. As the number of threads increase beyond the number of cores in each processor, the WFfaa version starts to outperform TBBvec. This test is similar to the phase of an application where values are being accumulated for later processing. On average, WFfaa performs 1.16 times as many operations per second as TBBvec, and 13.38 times as many as LFvec. Compared to WFcas and WFvec, WFfaa performs 2.3 and 22.12 times as many operations per second respectively.
Analysis
Both TBBvec and WFfaa scaled similarly well, while WFvec and LFvec scaled poorly. We attribute this poor scaling to the cost of supporting conflicting operations, which causes both algorithms to incur significant synchronization penalties during concurrent operations. The WFcas performs better than WFvec due to its simpler design and smaller critical section. However, its design does not distribute the contention like that of the faa based approaches, which reduces its scalability. Fig. 3b shows the performance of the vectors when only 10 percent of operations are pushBack and the majority of operations are random access reads. This test is similar to phase of an application where most threads are actively processing the accumulated elements in the vector. In this graph, each implementation exhibits a similar scaling pattern of increasing in performance up until eight to 16 threads after which they lose performance. WFvec's and LFvec's pushBack begin to lose performance sooner, while the faa based approaches maintain scalability for longer. This loss in performance is explained by the fact that each processor on the system has 16 cores. Testing on a system which supports a higher number of cores on a single processor may produce different scaling patterns. Fig. 3c examines the performance when the number of threads is held constant at 64 and the ratio of read operations to pushBack operations is varied. We see that the performance of each algorithm scales similarly, with WFfaa performing the best, followed by TBBvec, and then WFcas. The performance of each implementation significantly improves as the number of pushBack operations decrease. This is expected as the cost of reading a value is less than that of appending an element. Fig. 3d presents the case where only read operations are executing on the vector. In this case, WFvec performs on average 2.66 times as many operations as TBB's vector and 9.28 times as many as LFvec. This was unexpected, since all three models are using the same memory layout and should exhibit similar access times. After analysis, we attribute this difference in performance to the implementation of our approach. In our implementation, when the value of a variable changes once, we use loads with relaxed memory consistency semtantics to load its value. If the read value is the initial value, an atomic load is used to ensure correctness. This approach significantly decreases the cost of loading references to the vector's internal array. Fig. 4 presents the performance of the vectors when there are interleaved pushBack, popBack, and read operations. TBBvec, WFfaa, and WFcas are not included in this test as they do not support concurrent pushBack and popBack. In this scenario, on average, LFvec outperforms WFvec by a factor of 1.46, with both approaches scaling poorly as the number of threads increases. However, on low thread-counts or when there is a high ratio of read to tail operations, WFvec performs better than the LFvec. This is because WFvec has better support for read-through parallelism, and because, under low contention, tail operations have a higher degree of parallelism as compared to LFvec. Under high contention, the cost of the helping scheme used by WFvec overshadows this, resulting in better performance for LFvec. The methodology used by LFvec to support both operations has significant safety concerns (Section 3), and we believe that the marginal performance benefit at high thread counts does not justify the risk of such a design.
We are aware of no other concurrent vector implementations that support insertAt or eraseAt operations. Because, of this we compare the performance of our approach to that of the C++ Standard Library (STLvec) with a global lock. Fig. 5 presents a graph comparing the performance of WFvec and STLvec shift operations. It is no surprise that WFvec performs worse then STLvec by a factor of 5.28. In contrast to the sequential implementation of these algorithms, non-blocking implementations must provide a linearizable history of execution. In our presented implementation, concurrent eraseAt and insertAt operations become serialized at the highest shared position. Whichever shift operation acquires the highest shared position first will be completed before the other, effectively blocking the other shift operation. In contrast, contention on the global lock which protects the STLvec is inconsequential when considering the cost of shifting these elements. We include this implementation for two practical reasons. First, because insertAt and eraseAt operations are generally used infrequently, and supporting them does not have a negative effect on the complexity of other operations. Second, as we discussed in Section 8, the methodology used to construct these operations can be applied to other multi-position operations such as the map operation.
If we compare the performance of these vectors using thread groups (where threads are grouped by the type of operations they execute), we see a drastically different performance picture. Fig. 6 presents the results from a test where a main thread spawns three thread pools. Each of these executes a different type of operation: shift, tail, or random-access. The left side of this table indicates the number of threads assigned to perform a specified type of operation. The right side displays the factor by which the number of operations increase (+) or decrease (À) when using WFvec as compared to STLvec. Using this table, we see that STLvec performs on average 4.13 times as many shift operations as WFVec; however, WFvec performs on average 404,957.79 and 281.52 times as many tail and random-access operations, respectively. This significant difference in performance is attributed to the ability of our vector to execute operations concurrently in contrast to the STLvec, which is not threadsafe and must use mutual exclusion to protect the data structure, effectively removing any parallelism. Additionally, our random-access operations are able to identify the logical value of a position that holds a descriptor object. This allows them to be minimally affected by costly shift operations.
CONCLUSION
We presented a new concurrent vector that provides improvements over existing designs. These improvements include: providing a stronger guarantee of progress (waitfreedom), stronger safety properties (ABA-freedom and bounds checking), and support for more operations (insertAt and eraseAt). We developed a technique which uses association between operation records and descriptor objects to achieve wait-free progress when operations may access a variable number of words in memory. Our shift operations provide a technique for the development of other multi-position vector operations, empowering developers to operate on the vector in a more expressive manner.
We created a new resize procedure which allows elements to be contiguous in memory. This procedure facilitates concurrent access to the vector without requiring threads to help complete the entire resize. Instead it only requires a thread to copy elements pertinent to its operation. The design of our vector operations are not limited to the contiguous model, but can also be implemented using the segmented memory model.
We compared the presented vector with other known approaches using a series of micro-benchmarks and found the presented approach to be most performant in the majority of cases. In comparison to the only existing non-blocking vector, our new design performed on average 4.97 times more operations per second. When simulating the filling of a vector, it performs 13.38 times more.
We identified how supporting opposing operations has an adverse affect on both the complexity and performance of vector operations. We then presented a set of memorylayout-compatible alternative function models for tail operations which allows the developer to overcome this performance issue by leveraging the natural phases of execution in an application. Our analysis revealed that the throughput of the pushBack operation was increased by a factor of 22 as a result of applying this new model.
Our future goals include identifying data structures which we can simplify by applying our association model to provide wait-free progress and identifying structures where separating functionality into function models with relaxed semantics can improve performance.
Steven Feldman received the MS degrees in computer science from the University of Central Florida, in 2013. His research interests include concurrency, interthread helping techniques, and progress conditions. This had lead to the development of several wait-free algorithms.
Carlos Valera-Leon received the BS degree in computer science from the University of Central Florida, in 2014. His interests include concurrent algorithms, synchronization techniques, and distributed systems.
Damian Dechev is an assistant professor in the EECS Department, University of Central Florida and the founder of the Computer Software Engineering-Scalable and Secure Systems Lab, UCF. He specializes in the design of scalable multiprocessor algorithms and has applied them to real-time embedded space systems at NASA JPL and HPC data-intensive applications at Sandia National Labs. His research has been supported by grants from the US National Science Foundation (NSF), Sandia National Laboratories, and the Department of Energy. 
