There has been a significant amount of work in the literature proposing semantic relaxation of concurrent data structures for improving scalability and performance. By relaxing the semantics of a data structure, a bigger design space, that allows weaker synchronization and more useful parallelism, is unveiled. Investigating new data structure designs, capable of trading semantics for achieving better performance in a monotonic way, is a major challenge in the area. We algorithmically address this challenge in this paper.
Introduction
: An illustration of our 2D design using a Stack as an example. There are three sub-stacks a, b and c. k is proportional to the area of the green dashed rectangle in which operations are bounded to occur.
a can be used for both Push and Pop. b can be used for Push but not for Pop. c can be used for Pop but not for Push.
Concurrent data structures allow operations to access the data structure concurrently, which requires synchronised access to guarantee consistency with respect to their sequential semantics [11, 10] . The synchronisation of concurrent accesses is generally achieved by guaranteeing some notion of atomicity, where, an operation appears to occur at a single instant between its invocation and its response. A concurrent data structure is typically designed around one or more synchronisation access points, from where threads compute, consistently, the current state of the data structure. Synchronisation is vital to achieving consistency and cannot be eliminated [5] . Whereas this is true, synchronization might generate contention in memory resources hurting scalability and performance.
The necessity of reducing contention at the synchronisation access points, and consequently improving scalability, is and has been a major focus for concurrent data structure researchers. Techniques like; elimination [1, 19, 31] , combining [32] , dynamic elimination-combining [6] and back-off strategies have been proposed as ways to improve scalability. To address, in a more significant way, the challenge of scalability bottlenecks of concurrent data structures, it has been proposed that the semantic legal behaviour of data structures should be extended [29] . This line of research has led to the introduction of an extended set of weak semantics including; weak internal ordering, weakening consistency and semantic relaxation.
One of the main definition of semantic relaxation proposed and used in the literature is k-out-of-order [2, 20, 33, 16, 25, 34] . k-out-of-order semantics allow operations to occur out of order within a given k bound, e.g. a pop operation of a k-out-of-order stack can remove any item among the k topmost stack items. By allowing a Pop operation to remove any item among the k topmost stack items, the extended stack semantics do not anymore impose a single access point. Thus, they allow for potentially more efficient stack designs with reduced synchronisation overhead.
Relaxation can be exploited to achieve improved parallelism by increasing the number of disjoint access points, or by increasing thread local data processing. Disjoint access is popularly achieved by distributing operating over multiple instances of a given data structure [16, 25, 15] . On the other hand, the locality is generally achieved through binding single thread access to the same memory location for specific operations [34, 14, 15] .
In this paper, we introduce an efficient two-dimensional algorithmic design framework, that uses multiple instances (sub-structures) of a given data structure as shown in Figure 1 . The first dimension of the framework is the number of sub-structures operations are spread to, in order to benefit from parallelism through disjoint access points; the second dimension is the number of consecutive operations that can occur on the same sub-structure in order to benefit from data locality. We use two parameters to control the dimensions; width for the first dimension (horizontal) and depth for the second dimension (vertical) ass shown in Figure 1 .
A thread can operate on a given sub-structure for as long as a set of conditions hold (validity). Valid sub-structures do not exceed (max ) or go below (min), a given operation count threshold, as depicted by the dashed green rectangle in Figure 1 . Validity conditions make sub-structures valid or invalid for a given operation. This implies that threads have to search for a valid sub-structure, increasing operation cost (latency). Our framework overcomes this challenge by limiting the number of sub-structures and allowing a thread to operate on the same sub-structure consecutively for as long as the validity conditions hold. Max and min can be updated if there are no valid sub-structures. We show algorithmically that the validity conditions provide for an efficient, tenable and tunable relaxation behaviour, described by tight deterministic relaxation bounds.
Our design framework can be used to extend existing lock-free data structure algorithms to derive k-out-of-order semantics. This can be achieved with minimal modifications to the data structure algorithm as we later show in this paper. Using our framework, we extend existing lockfree algorithms to derive lock-free k-out-of-order stacks, queues, dequeue and counters. Detailed implementation, proof of correctness and performance analysis are also parts of the contribution of this paper. Experimental evaluation shows that the derived data structures significantly outperform all respective previous known data structure implementations. This work extends our preliminary work that consider only stack relaxation in two dimensions [27] .
The rest of the paper is structured as follows. In Section 2 we discuss literature related to this work. We present the 2D framework in Section 3 followed by the derived algorithms in Section 4. Algorithmic optimizations are discussed in Section 5. We present complexity analysis and correctness proofs in Section 6 and 7 respectively. With the aim to enrich our evaluation a pallet of additional multi structure algorithms are presented in Section 8. An experimental evaluation is presented and discussed in Section 9. The paper concludes in Section 10.
Related Work
Recently, data structure semantic relaxation has attracted the attention of researchers, as a promising direction towards improving concurrent data structures' scalability [20, 33, 29] . It has also been shown that small changes on the semantics of a data structure can have a significant effect on the computation power of the data structure [30] . The interest in semantic relaxation is largely founded on the ease of use and understanding. One of the main definition of semantic relaxation proposed and used is k-out-of-order .
Using the k-out-of-order definition, a segmentation technique has been proposed in [2] , later revisited in [20] realizing a relaxed Stack (k-Stack ) and FIFO Queue (Q-segment) with k-out-oforder semantics. The technique involves a linked-list of memory segments with k number of indexes on which an item can be added or removed. The stack items are accessed through the topmost segment, whereas the queue has a tail and head segment from which Enqueue and Dequeue can occur respectively. Segments can be added and removed. Relaxation is only controlled through varying the number of indexes per segment. As discussed in Section 1, increasing the number of indexes increases operation latency and later becomes a performance bottleneck. This limits the performance benefits of the technique to a small range of relaxation values.
Also, load balancing together with multiple queue instances (sub-queues) has been used to design a relaxed FIFO queue (lru) with k-out-of-order semantics [16] . Each sub-queue maintains two counters, one for Enqueue another for Dequeue, while two global counters, one for Enqueue another for Dequeue maintain the total number of operations for all sub-queues. The global counters are used to calculate the expected number of operations on the last-recently-used sub-queue. Threads can only operate on the least-recently-used sub-queue. This implies that for every operation threads must synchronise on the global counter, making it a sequential bottleneck. Moreover, threads have to search for the last-recently-used sub-queue leading to latency increase.
The proposed relaxation techniques, mentioned above, apply relaxation in one dimension, i.e, increase disjoint access points to improve parallelism and reduce contention. However, this also increases operation latency due to increased search cost. Without a remedy to this downside, the proposed techniques cannot provide monotonic relaxation for better performance. Other relaxed data structures studied in the literature include priority queues [4, 25, 34] . Apart from semantic relaxation, other design strategies for improving scalability have been proposed including; elimination [1, 31, 6] , combining [32] , internal weak ordering [12] , and local linearizability [15] . However, these strategies have not been designed to provide bounded out of order semantic relaxation.
Elimination implements a collision path on which different concurrent operations try to collide and cancel out, otherwise, they proceed to access the central structure [19, 24] . Combining, on the other hand, allows operations from multiple threads to be combined and executed by a single thread without the other threads contending on the central structure [13, 18] . However, their performance depends on the specific workload characteristics. Elimination mostly benefits symmetric workloads, whereas combining mostly benefits asymmetric workloads. Furthermore, the central structure sequential bottleneck problem still persists.
Weak internal ordering has been proposed and used to implement a timestamped stack (TSStack ) [12] , where Push timestamps each pushed item to mark the item's precedence order. Each thread has its local buffer on which it performs Push operations. However, Pop operations pay the cost of searching for the latest item. In the worst case, Pop operations might contend on the same latest item if there are no concurrent Push operations. This leads to search retries, especially for workloads with higher Pop rates than Push ones.
Local linearizability has also been proposed for concurrent data structures such as; FIFO queues and Stacks [15] . The technique relies on multiple instances of a given data structure. Each thread is assigned an instance on which it locally linearizes all its operations. Operations: Enqueue (FIFO queue) or Push (Stack) occur on the assigned instance for a given thread, whereas, Dequeue or Pop can occur on any of the available instances. With Dequeue or Pop occurring more frequently, contention quickly builds as threads try to access remote buffers. The threads also lose the locality advantage while accessing remote buffers, cancelling out the caching advantage especially for single access data structures such as the Stack [17, 8, 28 ].
The 2D Framework
In this section, we describe our 2D design framework and show how it can be used to extend existing data structure designs to derive k-out-of-order relaxed semantics. Such data structures include; stacks, FIFO queues, counters and dequeues.
The 2D framework uses multiple copies (sub-structures) of the given data structure as depicted in Figure 1 . Threads can operate on any of the sub-structures following the fixed maximum max and minimum min operation count threshold. Herein, operation refers to the process that updates the data structure state by adding (Put) or removing (Get) an item (Push and Pop respectively for the stack example). Each sub-structure holds a counter (sub-count) that counts the number of local successful operations.
A combination of max, min and number of sub-structures, form a logical count period, we refer to it as Window , depicted by the dashed green rectangle in Figure 1 . The Window limits the number of operations that can occur on each sub-structure ([W in max , W in min ]): maximum (W in max ) and minimum (W in min ) operation count threshold for all sub-structures, for a given period. This implies that, for a given period, a sub-structure can be valid or invalid as exemplified in Figure 1 , and a Window can be full or empty. The Window is full if all sub-structures have maximum operations (sub-count = W in max ), empty, if all sub-structures have minimum operations (sub-count = W in min ). The Window is defined by two parameters; width and depth. width = #sub-structures, and depth = W in max − W in min .
To validate a sub-structure, its sub-count is compared with [W in max , W in min ]; either sub-count ≥ W in min or sub-count < W in max . If the given sub-structure is invalid, the thread has to hop to another sub-structure until a valid sub-structure is found (validity is operation specific as we discuss later). If a thread cannot find a valid sub-structure, then, the Window is either full or empty. The thread will then, either increment or decrement [W in max , W in min ], the process we refer to as, Window shif ting.
We define two types of Windows: WinCoupled (2Dc) and WinDecoupled (2Dd ). WinCoupled implements one joint access Window for all operations whereas WinDecoupled implements two disjoint access Windows. Data structures such as FIFO queues with disjoint access for Put and Get, can benefit more from the WinDecoupled disjoint Window design. Whereas, data structures such as stacks with joint access, can benefit more from the WinCoupled .
WinCoupled
WinCoupled couples both Put and Get to share the same Window and sub-count for each substructure. A successful Put increments whereas, a successful Get decrements the given sub-count. On a full Window , Put increments W in max shif ting the Window up (shif t up ), whereas, on an empty Window , Get decrements W in max , shif ting the Window down (shif t down ). WinCoupled resembles elimination [31] , only that here, we cancel out operation counts for matching Put and Get on the same sub-structure within the same Window . Just like elimination reduces joint access updates, WinCoupled reduces Window shif t updates.
In Algorithm 1, we present the algorithmic steps for WinCoupled . Recall, width = #sub-structures and depth = W in max − W in min . Each sub-structure is uniquely identified by an index, which holds information including a pointer to the sub-structure, sub-count counter, and a version number (line [1] [2] [3] [4] . The version number is to avoid ABA related issues. Using a wide CAS, we update the index information in a single atomic step.
To perform an operation, the thread has to search and select a valid sub-structure within a Window period. Starting from the search start index, the thread stores a copy of the Window locally (line 9) which is used to detect Window shif ts while searching (line 18, 32) . During the search, the thread validates each sub-structure count against W in max (line 28, 30 If a valid index is selected, the respective descriptor state and index are returned (line 29,31). The thread can then proceed to try and operate on the given sub-structure pointed to by the index descriptor. As an emptiness check, the Window search can only return an empty sub-structure (line 15), if during the search, all sub-structures where empty (N U LL pointer). Using the Window parameters, width, and depth, we can tightly bound the relaxation behaviour of derived 2D datastructure as discussed in Section 7.
However, we should note that, Window and sub-structure updates occur independent of each other. For WinCoupled , this can lead to a sub-structure being updated although the Window it was selected from has since shifted. Take as an example, a Get selecting a sub-structure from a full Window (w 1 ) at time (t 1 ), followed by a Put that reads the full Window and shif ts it up to w 2 at t 2 . It is possible for the Get to update the selected sub-structure at t 3 based on w 1 that has since shifted to w 2 . This difference is however bounded as proved in Section 7. 
WinDecoupled
WinDecoupled decouples Put and Get and assigns them independent Windows. Also, an independent sub-count is maintained for Put or Get, on each sub-structure. Unlike WinCoupled , both operations always increment their respective sub-count on a successful operation and W in max on a full Window . This implies that both sub-count and Window counters are always increasing as shown in Algorithm 2.
Recall that operations are decoupled to operate with independent Windows (line 12, 14) . Here the sub-structure index differs from that under WinCoupled ; it includes a pointer to the sub-structure, Put-sub-count counter for Put (line 4) and Get-sub-count counter for Get (line 3). Note that we do not need version number since the counter updates monotonically increment, which avoids the ABA problem. Index information is stored in a descriptor (Des) that can be updated in one atomic step using a wide CAS operation, just like in WinCoupled .
To perform an operation, the thread has to call the Window function (Window()) to obtain a valid sub-structure for the given operation. Window takes three arguments: type of operation, search start index and contention indicator (line 8). The type of operation determines which Window to operate with. Starting from the search start index, the thread stores a copy of the Window locally (line 19,52) which is used to detect Window shif ts while searching (line 34,72). During the search, the thread validates each sub-structure against W in max (line 30,66).
If no valid index is found, the W in max is updated atomically, shif ting the Window up. Both Put and Get shif t the Window up by incrementing the W in max count by depth (line 24,60). Before Window shif ting or hopping to another index, the thread has to confirm that the Window has not shifted from the locally known state (line 34, 72) . This guarantees that threads always start their search within the most current Window . A fresh search is started for every Window shif t (IndexSearch = 0)(line 27,63) and call to the window function. If a valid index is found, the respective descriptor state and index are returned (line 31,67). The thread can then proceed to try and operate on the given sub-structure pointed to by the descriptor.
As an emptiness check, the Window search can only return an empty sub-structure, if during the search, all sub-structures where empty. This is achieved by maintaining a notempty variable, which when updated during a Get Window search, signals a none empty state (line 77). If all sub-structures are found empty within the same Window search, a processor can return empty state (line 57). An empty sub-structure is identified by a N U LL pointer within its descriptor (line 68).
Deriving 2D Data structures
Our framework can be used to extend existing algorithms to derive k-out-of-order data structures. Using WinCoupled we derive a 2Dc-Stack and a 2Dc-Counter , whereas by using WinDecoupled , we derive a 2Dd-Stack , a 2Dd-Queue, a 2Dd-Deque and a 2Dd-Counter as shown in Table 1 . The base algorithms include but not limited to; Treiber's stack [7] , MS-queue [22] and Deque [21] for Stack, FIFO Queue and Deque respectively.
2D-Stack
A stack is characterized by two operations: Push that adds an item and Pop that removes an item from the stack. Our derived 2D-Stack algorithms are composed of multiple lock-free sub-stacks. Each sub-stack is implemented using a linked-list following the Treiber's stack design [7] , modified only to fit the Window design. In Algorithm 3 we present the algorithmic implementation of 2Dc-Stack as an example of how to use WinDecoupled to derive k-out-of-order data structures. The stack head is modified to a descriptor containing the top item pointer, operation count, and descriptor version. Note that, the descriptor can still be updated in one atomic step using a wide CAS (line 8,21), the same way as in the Treiber's stack.
To perform an operation, a given thread obtains a sub-stack by performing a Window search (line 3,16). The thread then prepares a new descriptor based on the existing descriptor at the given index (line [4] [5] [6] [7] [18] [19] [20] . Using a CAS, the thread tries to atomically swap the existing descriptor with the new one (line 8, 21) . If the CAS fails, the thread sets the contention indicator to true (line 11, 24) and restart the Window search. The contention indicator (cont) signals the presence of contention at the current sub-stack generally representing possible contention globally.
A successful Push increments whereas a Push decrements the operation count by one (line 6,19). Also, the topmost item pointer is updated. At this point, a Push adds an item whereas a Pop returns an item for a non-empty or N U LL for empty stack (line 27). An empty sub-stack is represented by a N U LL item pointer within the descriptor (line 17). Recall that the framework performs a special emptiness check before returning an empty sub-stack .
WinDecoupled can be used to derive the 2Dd-Stack . This follows the same procedure as discussed above, with the difference being that; Push and Pop increment different operation counters on success. For Push, line 6 changes to NDes.putcount = Des.putcount+1 and line 7 changes to NDes.getcount = Des.getcount. On the other hand, line 19 changes to NDes.getcount = Des.getcount+1 and line 20 changes to NDes.putcount = Des.putcount. Note that, we do not need version number because the counters are always incrementing. This implies that, each sub-stack update has a unique count value, eliminating the ABA problem on individual sub-stacks. Queues are characterized by two operations, Enqueue which adds an item to the queue and Dequeue which removes an item. The two operations access the queue from different points; head for Dequeue and tail for Enqueue. We use WinDecoupled to derive a 2D-Queue, due to its ability to maintain the independent operation counts. 2D-Queue is composed of multiple lock-free subqueues. Each sub-queue is implemented using a linked list following the Michael Scott FIFO queue (MS-queue) design [22] , modified to fit the Window processes as shown in Algorithms 4. Each subqueue has two unique indexes: Put,Get. The queue is modified, replacing the head and tail with independent descriptors each containing an item pointer, Enqueue count or Dequeue count. The descriptor is also updated in one atomic step using a wide CAS (line 15,38), same as in MS-queue updates.
2D-Queue
To perform an operation, a given thread obtains a sub-queue by performing a Window search (line 3,29). An Enqueue completes in two steps: First, the thread tries to add the new item to the list (line 8), if successful, then tries to update the tail descriptor with the new state (line 20). The thread can also help update the descriptor if it encounters an incomplete Enqueue (line 15). Any CAS failure during an Enqueue (line 11, 16, 21) signals the presence of contention on the given sub-queue and the thread has to retry the Window search. The same applies to the Dequeue (line 46).
Both Enqueue and Dequeue increment their respective sub-queue index operation counts by one on success (line 6,42). Note that, Dequeue can increment the Enqueue operation count if it helps complete a pending Enqueue (line 37). Also the item pointers are updated. At this point, an Enqueue adds an item whereas a Dequeue returns an item for a non empty sub-queue or N U LL for empty Queue. As an emptiness check, the Dequeue only returns N U LL if the Window search cannot find a non empty sub-queue (line 27).
2D-Deque
The relaxed double-ended queue (deque) design is based on the lock-free design of Maged [21] . The design follows a doubly-linked list structure, where each node contains pointers to its right and left neighbors, and includes a data field. The two ends of the doubly-linked list are pointed to by two head pointers and there is a variable that keeps the status tag. These three variables are framed into a descriptor which is replaced atomically by operations. The status tag indicates whether the deque is stable or not. When a process finds the deque in an unstable state, it must first attempt to take it to the stable state (help the incomplete operation) before attempting its own operation. The design does not provide disjoint access parallelism such that all concurrent operations serialize since all try to modify the same descriptor, even when the operations occur on the opposite ends of a non-empty deque. In other words, there is at most one incomplete operation when the deque is in an unstable state. The abstract data type provides pushRight, popRight, pushLef t, popLef t operations. Pop operations does not require helping as they complete in an single step, with a single CAS. Push operations are composed of three steps, the first one linearizes the operation and put the structure into an unstable state, and the remaining two is subject to helping from other threads. The first helping step sets the relevant back pointer and the following other puts the deque into stable state.
We convert this design into a sub-structure that is used by our generic methodology. This can simply be done by adding the four counters, one for each operation type, into the descriptor of the deque design. When an operation occurs at the sub-structure, the respective counter is incremented atomically, at the linearization point of the operation in the original design. In the design (both original and our sub-structure), there is a total symmetry between operation that occur at the right and left of the deque (i.e. the code is totally same except one modifies right head and the other left head). Therefore, we will just provide the pseudo-code (See Figure 5) for the pushRight and popRight operations that are slightly modified versions of the original design.
2D-Counter
2D-Counter is characterized by two operations; increment (Put) which increases the counter and decrement (Get) which decreases the counter. It is composed of multiple sub-counters whose local count (sub-count) can only be greater than or equal to zero. Both 2Dc-Counter and 2Dd-Counter follow the 2Dc-Stack and 2Dd-Stack implementation details. Following the same strategy, a given successful operation, increments or decrements a given sub-counter , then calculates the global count value from (sub-count × width). It is the global count that is returned by the thread.
Optimizations
The design can be tuned to optimize for; locality, contention and sub-structure search overhead (hops), using the width and depth parameters.
Locality
To exploit locality, the thread starts its search from the previously known index on which it succeeded. This allows the thread a chance to operate on the same sub-structure multiple times locally, given that the sub-structure is valid. To further improve on locality, a thread that fails on a selected sub-structure (Algorithm 3: line 11,24, Algorithm 4: line 11,16,21,46), randomly selects another sub-structure (Algorithm 1: line 11, Algorithm 4: line 9) leaving the successful thread to take over locally. The thread operates on the same sub-structure locally, for as long as it does not fail on the sub-structure CAS and the sub-structure is valid. Working locally improves the caching behaviour, which in return improves performance. Locality also reduces the number of hops (Theorem 2). The cost of hopping includes reading new memory locations, which comes with different memory latency and cache coherence costs. Reducing the number of hops improves throughput performance especially under NUMA execution environment with high communication cost between NUMA nodes [17, 8, 28 ].
Contention
A failed operation on a valid sub-structure signals the possibility of contention. The thread that fails on a CAS (Algorithm 3: line 11,24), starts the Window search on a randomly selected index (Algorithm 1: line 11). This reduces possible contention that might arise if the failed threads were to retry on the same sub-structure. Furthermore, random selection avoids contention on individual sub-structures by uniformly distributing the failed threads to all available sub-structures.
For every Window search, if the search start index is invalid, the thread tries a given number of random jumps (Algorithm 1: line 34, Algorithm 2: line 36,74), then switches to round robin (Algorithm 1: line 42, Algorithm 2: line 41,82) until a valid sub-structure is found. In our case, we use two random jumps as the optimal number for a random search basing on the power of random two choices [23] . However, this is a configurable parameter that can take any value.
We further note that contention is inversely proportional to the width. As a simple model, we split the latency of an operation into contention (op cont ) and contention-free (op f ree ) operation costs, given by op = op cont /width + op f ree . This means that we can increase the width to further reduce contention when necessary.
sub-structure Search
The number of hops increases with an increase in width. This counteracts the performance benefits from contention reduction through increasing width, necessitating a balance between contention and hops reduction. Based on our simple contention model above, the performance would increase as the contention factor vanishes with the increase of width, but with an asymptote at 1/op f ree . This implies that beyond some point, one cannot really gain throughput by increasing the width, however, throughput would get hurt due to the increased number of hops. At some point as width increases, gains from the contention factor (lim width→∞ op cont → 0) are surpassed by the increasing cost of hops. This is something that we want to avoid in our effort of relaxing semantics for throughput gain. To avoid this, we switch to increasing depth instead of width, at the point of width saturation. Increasing depth reduces the number of hops. This is supported by our step complexity analysis presented in Theorem 2.
WinCoupled can be optimized to minimize the number of search hops that arise from shif ting the Window . Put operations incur more hops on a full Window , so does Get on an empty Window . This is due to the reduced number of valid sub-structure indexes. To recall, a Put shif ts the Window up to an empty state (all sub-structures having minimum operation count), invalidating all sub-structures for Get operations (Algorithm 1: line 30), so does the Get that shif ts the Window down to a full state (all sub-structures having maximum operation count), invalidating all substructures for Put operations (Algorithm 1: line 28). To reduce on this effect and minimize the number of hops on both full and empty Window states, we configure the Window to shif t with respect to the operation rate. This implies that, if Pushrate > Poprate then shif t up > shif t down where shif t up + shif t down = depth. With this configuration, the Window can never shif t to a full or empty state, reducing the sub-structure search cost through reduced number of hops.
WinDecoupled can be optimized to minimize the number of search hops that arise from full sub-structures. Here we reduce sub-structure search cost, by letting a given thread avoid revisiting sub-structures that are locally known, to be in full state. Recall that under WinDecoupled , substructures only become invalid when full. This implies that a thread can memories full sub-structures and avoid revisiting them while searching within the same Window . This is achieved through keeping a local map of all sub-structure full states for a given Window for a given operation (Algorithm 2: line 32,69). A thread searches its local map for a valid state, before accessing the given sub-structure index for validation (Algorithm 1: line 28, 64). Local searches improves locality and avoids expensive remote memory reads where possible.
Complexity Analysis

WinCoupled
In this section, we analyze the relation between width and depth using WinCoupled . However, the same result applies for WinDecoupled (See Appendix 6.2). We provide the expected step complexity of a sequential process, where a single thread applies the sequence of operations. The type of an operation in the sequence is determined independently with a fixed probability, where p denotes the probability of a Put operation. With multiple sub-structures, it is possible to make multiple hops in search of a valid sub-structure.
Let Global (W in max ) regulates the size of the sub-structures. Recall that width = #sub-structures and the size of sub-structure i by N i Put and Get are allowed to occur at sub-structure i, if N i ∈ [Global − depth, Global − 1] and N i ∈ [Global − depth + 1, Global], respectively. This basically means that, at any time, the size of a sub-structure can only variate in the vicinity of Global , more precisely: ∀i, (Global − depth) ≤ N i ≤ Global. To recall, this interval is valid for the sequential process. We refer to this interval as the active region.
We introduce random variables N active i = N i − (Global − depth) and N active i ∈ [0, depth] that provides the number of items in the active region of the sub-structure i and the random variable
provides the total number of items in the Window . As mentioned before the depth dimension tries to exploit locality, thus, a thread starts an operation with a query on the sub-structure where the last successful operation occurred. This means that the thread hops iff N active i = 0 or N active i = depth respectively for a Get or a Put operation. Therefore, the number of sub-structures, whose active regions are full, is given by (N active /depth) at a given time, because the thread does not leave a sub-structure until its active region gets either full or empty. If the thread hops a sub-structure, then a new sub-structure is selected uniformly at random from the remaining set of sub-structures. If none of the sub-structures fulfills the condition (which implies that N active =0 at a Get or N active = depth × width at a Put), then the window shif ts based on a given shif t parameter. (i.e. for a Put operation Global = Global + shif t up and for a Get operation Global = Global − shif t down , where 1 ≤ shif t down , shif t up ≤ depth). One can observe that the value of N active before an operation defines the expected number of hops and the shif t of the Window .
To compute the expected step complexity of an operation that occurs at a random time, we model the random variation process around the Global with a Markov chain, where the sequence of Put and Get operations lead to the state transitions. As a remark, we consider the performance of the sub-structures mostly when they are non-empty, since Get (N U LL) and Put would have no hops in this case. The Markov chain is strongly related to N active . It is composed of K + 1 states S 0 , S 1 , . . . , S K , where K = depth × width. For all i ∈ 0, K , the operation is in state S i iff N active = i. For all (i, j) ∈ 0, K + 1 2 , P (S i → S j ) denotes the state transition probability, that is
given by the following function, where p denotes the probability of a Put:
The stationary distribution (denoted by the vector π = (π i ) i∈ 0,K ) exists for the Markov chain , since the chain is irreducible and positive recurrent (note that state space is finite). The left eigenvector of the transition matrix with eigenvalue 1 provides the unique stationary distribution. Lemma 1. For the Markov chain that is initialized with p = 1/2 and shif t, where l = shif t × width−1, the stationary distribution is given by the vector π l = (π l 0 π l 1 ..π l K ), assuming K −l >= l (for l > K − l, one can obtain the vector from the symmetry
Proof. We have stated that the stationary distribution exist since the chain is aperiodic and irreducible for all p and shif t. Let (M i,j ) (i,j)∈ 0,K 2 denote the transition matrix for p = 1/2 and shif t. The stationary distribution vector π l fulfills, π l M = π l , that provides the following system of linear equations:
In case, l = K − l, then (iv) and (v) are replaced with 2π l (l=K−l) = π l i−1 + π l i+1 + π l 1 + π l K . Based on a symmetry argument, one can observe that, for all l, π l i = π l K−i the system can be solved in linear time (O(K)) by assigning any positive (for irreducible chain π l i > 0) value to π l 0 . The stationary distribution is unique thus for any π l 0 , π l spans the solution space. We know that K i=0 π l i = 1, starting from π l 0 = 1, we obtain and normalize each item by the sum.
An operation starts with the search of an available sub-structure. This search contains at least a single query at the sub-structure where the last success occurred, the rest incur a hop step. In addition, the operation might include the shif t of the window, as an extra step, denoted by Glo.
We denote the number of extra steps with Extra = hop + Glo. With the linearity of expectation, we obtain E (Extra) = E (hop) + E (Glo). Relying on the law of total expectation, we obtain:
op∈{pop,push} E (Glo|S i , op) P (S i , op); where P (S i , op) denotes the probability of an operation to occur in state S i . We analyze the algorithm for the setting where shif t = depth and p = 1/2. We do this because the bound, that we manage to find in this case, is tighter, and gives a better idea of the influence of the 2D parameters to the expected performance. For this case the stationary distribution is given by Lemma 1.
Theorem 2. For a 2Dc-structure that is initialized with parameters depth, width, shif t = depth and p = 1/2, E (Extra) = O( ln width depth ).
Proof. Firstly, we consider the expected number of extra steps for a Put operation. Given that there are N active items, a Put attempt would generate an extra step if it attempts to push to a sub-structure that has N active i = depth items. Recall that the thread sticks to a sub-structure until it is not possible to conduct an operation on it. This implies that the extra steps can be taken only in the states S i such that i(mod depth) = 0, because the thread does not leave a sub-structure before N active i = 0 or N active i = depth. In addition, a Put (Get) can only experience an extra step if the previous operation was also a Put (Get).
Given that we are in S i such that i(mod depth) = 0, then the first requirement is to have a Put as the previous operation. If this is true, then the Put operation hops to another sub-structure, which is selected from the remaining set of sub-structures uniformly at random. At this point, there are f = i depth − 1 full sub-structures in the remaining set of sub-structures. If a full sub-structure is selected from this set, this leads to another hop and again a sub-structure is selected uniformly at random from the remaining set of sub-structures.
Consider a full sub-structure (one of the f ), this sub-structure would be hopped if it is queried before querying the sub-structures that are empty. There are width − f − 1 empty sub-structures, thus a hop in this sub-structure would occur with probability 1/(width − f ). There are f such substructures. With the linearity of expectation, the expected number of hops is given by: f /(width − f )+1 = width/(width−f ). Which leads to E (hop|S i , Put) = p×width/(width−f ) if i(mod depth) = 0 or E (hop|S i , Put) = 0 otherwise. From Lemma 1, π i < 2/(K + 1) we obtain:
The bounds for E (hop|Put) also hold for E (hop|Get). Given that there are K − i (system is in state S i ) empty sub-structures then there are e = K−i depth − 1 sub-structures whose Window regions are empty, minus the sub-structure that the thread last succeeded on. Using the same arguments that are illustrated above (replace f with e and p=1-p), we obtain the same bound.
Window only shif ts at S K if a Put operation happens and at S 0 if a Get operation happens.
(1 − p). Finally, using E (Extra) = E (hop) + E (Glo) we obtain the theorem.
WinDecoupled
Now, we apply the same reasoning for the 2Dd-Queue to analyze WinDecoupled behaviour. There are two Windows, let Global enq and Global deq represent W in max for Enqueue and Dequeue respectively. The two counters; Global enq and Global deq , increase monotonically. Enqueue and Dequeue have the same complexity since they apply the same Window strategy with the only difference that one consumes and the other produces elements. Therefore, we analyze only Enqueue. Let N active i denote the number of elements in the active region of the sub-queue i for Enqueue. We have a single logical option for shif t =depth since the Global enq is monotonically increasing.
We again model the process with a Markov chain where the states are strongly related to
. It is composed of K states S 1 , . . . , S K , where K = depth × width. For all (i, j) ∈ 0, K 2 , P (S i → S j ) denotes the state transition probability, that is given by the following
, where π i = 1/K. Theorem 3. For a 2D-Queue that is initialized with parameters depth, width, shif t = depth, E (Extra) = O( ln width depth ). Proof. We consider the expected number of extra steps for an Enqueue that would generate an extra step if it attempts on a sub-queue that has N active i = depth items. Recall that the thread sticks to a sub-queue until it is not possible to conduct an operation on it, thus extra steps will be taken only in the states S i such that i(mod depth) = 0 and before the first hop, there are f = i depth − 1 full sub-queues in the remaining set of sub-queues. Plugging, π i = 1/(K) into the reasoning that is provided in Theorem 2, we obtain the theorem.
Correctness
In this section, we prove the correctness of the derived data structures, including their relaxation bounds and lock freedom. All our derived 2D data structures are linearizable with respect to k-outof-order semantics for the respective data structure. To be consistent with the previous sections, here we present the 2Dc-Stack correctness proofs and list the Theorems that prove the correctness for the other 2D data structures. Relaxation can be applied method-wise and it is applied only to Pop operations, that is, a Pop pops one of the topmost k items. Firstly, we require some notation. The Window defines the number of operations allowed to proceed on any given sub-stack . The Window is shifted by the parameter shif t, 1 ≤ shif t < depth and width = #sub-stacks. = Global. The number of items of the sub-stack j is denoted by N j , 1 ≤ j ≤ width. To recall, the top pointer, the version number and N j are embedded into the descriptor of sub-stack j and all can be modified atomically with a wide CAS instruction.
2Dc-Stack
Lemma 4. Given that Global = shif t×i, it is impossible to observe a state(S) such that N j > W max i+1 (or N j < W min i−1 ). Proof. We show that this is impossible by considering the interleaving of operations. Without loss of generality, assume thread 1 (P 1 ) has set Global = shif t × i at time t 1 . To do this, P 1 should have observed either Global = shif t × (i − 1) and then N j = W max i−1 or Global = shif t × (i + 1) and then N j = W min i+1 . Let this observation of Global happen at time t 1 . Consider the last successful push operation at sub-stack j before the state S is observed for the first time (we do not consider Pop operations as they can only decrease N j to a value that is less than W max i+1 , this case will be covered by the first item below). Assume thread 0 (P 0 ) sets N j to N j > W max i+1 in this push operation. P 0 should observe N j ≥ W max i+1 and Global > W max i+1 . Let j be selected at time t 0 . And the linearization of the operation happens at t 0 > t 0 .
• If t 0 < t 1 , the concerned state(S) can not be observed since Global cannot be changed (to shif t × i) after N j > W max i+1 is observed.
• Else if t 1 < t 0 , the concerned state(S) cannot be observed since the push operation cannot proceed after observing Global with such N j .
• Else if t 1 > t 0 , then P 0 cannot linearize because, this implies N j has been modified (the difference between the value of Global that is observed by P 0 and then by P 1 implies this) since P 0 had read the descriptor, at least the version numbers would have changed since then.
• Else if t 1 < t 0 , then this implies Global has been modified, since it was read by P 1 , thus updating Global would fail, at least based on the version number.
Lemma 5. At all times, there exist an i such that ∀j, 1 ≤ j ≤ width:
Proof. Informally, the lemma states that the size (number of operations) of a sub-stack spans to at most two consecutive accessible Windows. Assume that the statement is not true, then there should exist a pair of sub-stacks (y and z) at some point in time such that ∃i, N y < W min i and N z > W max i+1 . Consider the last Push at sub-stack z and last Pop at sub-stack y that linearize before or at the time t.
Assume thread P 0 (Push) sets N z and thread P 1 (Pop) sets N y . To do this, P 0 should observe N z ≥ W max i+1 and Global > W max i+1 , let sub-stack z be selected at t 0 . And, the linearization of the Push operation occurs at t 0 > t 0 . Similarly, for P 1 Pop operation, let sub-stack y be selected at t 1 , P 1 should have observed Global ≤ W min i . And, let the Pop operation linearize at time t 1 > t 1 . Now, we consider the possible interleavings.
• If t 0 < t 1 (or the symmetric t 1 < t 0 for which we do not repeat the arguments), then for P 1 to proceed and pop an item from sub-stack y, it is required that Global ≤ W min i . Based on Lemma 4, this is impossible when N z > W max i .
• Else if t 1 > t 0 , then P 0 cannot linearize, because this implies that N z has been modified (the difference between the value of Global that is observed by P 0 and then by P 1 implies this) since P 0 has read the descriptor . At least, the version number would have changed since then.
• Else if t 0 > t 1 , the argument above holds for P 1 too, so P 1 should fail to linearize.
Such N z and N y pair can not co-exist at any time.
Theorem 6. 2Dc-Stack is linearizable with respect to k-out-of-order stack semantics, where k = (2shif t + depth)(width − 1).
Proof. Consider the Push (t push e
) and Pop (t pop e ) linearization points, that insert and remove an item e for a given sub-stack j respectively, where, t − shif t) = depth + 2shif t. We know that this number is zero for sub-stack j (the sub-stack that e is inserted) and we have width − 1 other sub-stacks. So, there can be at most (2shif t + depth)(width − 1) items that are pushed after t push e and are not popped before t pop e .
2Dd-Stack
Theorem 7. 2Dd-Stack is linearizable with respect to k-out-of-order stack semantics, where k = (3depth)(width − 1).
Proof. Consider the linearization points of
Push and Pop operations that respectively insert and remove the item e into and from a sub-stack (sub-stack i). Let t push e and t pop e denote these points, respectively. Now, we bound the maximum number of items, that are pushed after t push e and are not popped before t pop e , to obtain k. We denote the number items that are pushed to (popped from) sub-stack j in the time interval [t push e , t pop e ], with push j (pop j ). Regarding the interval [t push e
, t
pop e ], we have: (i) push i = pop i , since the number of items that are pushed into and popped from sub-structure i should be equal; (ii) ∀j ∈ [1, width], push j ≤ push i + depth + (depth − (push i mod depth)); and (iii) ∀j ∈ [1, width], pop j ≥ pop i − depth − (pop i mod depth).
Therefore, for any sub-stack , the number of items that are pushed after t push e and are not popped before t pop e (push j − pop j ) is at most: push i + depth + (depth − (push i mod depth)) − (pop i − depth − (pop i mod depth)) = 3depth. Summing over all j = i, we obtain the theorem.
2Dd-Queue
Theorem 8. 2Dd-Queue is linearizable with respect to k-out-of-order stack semantics, where k = (depth)(width − 1).
Proof. The linearization points for 2Dd-Queue operations follow MS-queue design. As shown in Algorithm 4, Dequeue linearizes either by returning NULL (line 27) or with a successful CAS (line 44). Push linearizes with a successful CAS (line 8). For readability reasons, We use Global enq and Global deq as a representative for Enqueue and Dequeue W in max respectively. For Enqueue (Dequeue), 2Dd-Queue algorithm searches for a sub-queue whose enqueue (dequeue) counter is smaller than Global enq (Global deq ). If no such sub-queue exist, the Global enq (Global deq ) is increased by depth.
One can observe that Global enq and Global deq are monotonically increasing. Also, Global enq = i × depth iff the Enqueue counter for all sub-queues are in the range
The same holds for Global deq . Global enq i is updated to Global enq i+1 = Global enq i + depth as a result of Enqueue iff Enqueue counter for all sub-queues are equal to Global enq i before the Enqueue. These invariants hold at all times since the concurrent operations, that might violate them, would fail at CAS instructions that modifies any sub-queue or any Global .
Based on these invariants, all the items that are enqueued while Global enq = i × depth will be dequeued while Global deq = i × depth. Therefore, the maximum number of items that are enqueued before an item and are not dequeued before that item can be at most width × depth. Disregarding the items that are enqueued on the same sub-queue, we obtain the theorem.
2Dd-Deque
Theorem 9. 2Dd-Deque is linearizable with respect to k-out-of-order dequeue semantics, where k = (8depth)(width − 1).
Proof. Consider the linearization points of push and pop operations that respectively insert and remove the item e into and from a sub-structure i. Let t push e and t pop e denote these points, respectively. First, we consider the case where push and pop operations on item e occur at the same side of the deque. Without loss generality, we assume they happen at the right side of the deque. In this case, we bound the maximum number of items, that are pushed after t
Second, we consider the case where push and pop operations on item e occur at the opposite sides of the deque. Without loss generality, we assume push operation happens at the right side of the deque. In this case, we bound the maximum number of elements: (i) that are pushed after t push e from left and are not popped before t pop e ; (ii) the elements that are already inside the deque at t push e and are not popped before t pop e . Summing these two terms will provide an upper bound (though not necessarily a tight one) and we obtain k. Let size j denotes the size of the sub-structure j (number of elements inside deque j) at time t push e . pushLef t j and popLef t j denotes the number of element that are pushed (and popped) to (from) sub-structure j from left in the time interval [t push e , t pop e ]. We know that for sub-structure i, we have popLef t i = pushLef t i + size i . And, we have the following three relations: ∀j ∈ [1, width], popLef t i ≤ popLef t j + 2depth, pushLef t i ≥ pushLef t i − 2depth, size i ≥ size j − 4depth. Thus, we obtain ∀j ∈ [1, width], 8depth ≥ pushLef t j + size j − popLef t j . Summing over all sub-structures 8depth(width − 1) ≥ width−1 j=0 pushLef t j + size j − popLef t j , we obtain an upper bound for the maximum number of elements that could be covered by (i) and (ii).
Finally, we obtain k for a pop operation by taking the maximum of case one and two: max(8depth(width − 1), 4depth(width − 1)) = 8depth(width − 1).
2Dc-Counter
Theorem 10. 2Dc-Counter is linearizable with respect to k-out-of-order counter semantics, where k = (2depth)(width − 1).
Proof. Lemma 5 states that the size of sub-structure can span to at most two consecutive accessible Windows, which implies that the difference between any two sub-counters can be at most 2depth at any point in time. Let counter i denotes the counter value for sub-counter i. One can observe an error at most (2depth)(width − 1) because width j=1 counter j − (width)counter i ≤ (2depth)(width − 1). 
2Dd-Counter
Lock-freedom
WinCoupled , Window shif ting is lock-free iff shif t < depth and obstruction free iff shif t = depth. Take an example, a Get operation might read an empty Window and shif t it down to a full state, but before it selects a sub-structure, a subsequent Put reads the full Window state and shif ts it up to an empty Window state. It is possible for this process continue forever leading to a system live lock. This is however avoided by setting the shif t parameters to less than depth.Unlike WinCoupled , WinDecoupled is always lock-free.
Each sub-structure is lock-free: An operation can fail on CAS only if there is another successful operation. A Window shif t can only fail if there is another successful shif t operation preceded by a successful Put or Get, ensuring system progress. Thus, all our derived algorithms are lock-free with the exception of possible obstruction freedom as discussed above.
Other Algorithms for Comparison
To facilitate a detailed study, we implement three extra relaxation techniques following the same multi sub-structure design; Random, Random-C2 and Round-Robin. These present a combination of characteristics that add value to our evaluation, as shown in Table 2 . Just like our design, we use the width parameter to define the number of sub-structures for all the derived algorithms.
For Random, a Put or Get operation selects a sub-structure randomly and proceeds to operate on it, whereas for Random-C2 , a Get operation randomly selects two sub-structures, compares their items returning the most correct depending on the data structure semantics [25, 3] . Put operations time stamp items marking their time of entry. It is these timestamps that are compared to determine the precedence order among the two items during a Get operation. Due to the randomized distribution of operations, we expect low or no contention, no locality, and no hops. These three characteristics help us compare and contrast with our optimisations discussed in Section 5. We derive S-random and S-random-c2 stacks, Q-random and Q-random-c2 queues, C-random and C-random-c2 counters for both Random and Random-C2 respectively. Random and Random-C2 derived algorithms do not provide deterministic k-out-of-order relaxation bounds.
Under Round-Robin, a thread selects and operates once on a sub-structure in a strict roundrobin order following its local counter. The thread must succeed on the selected sub-structure before proceeding to the next. Due to retries by contending threads on the same sub-structure, we expect contention and no hops. The thread selects a new sub-structure for each successful operation, hence we expect low or no locality. Also round robin scheduling of memory access operations can take advantage of hardware prefecthing, a good characteristic to compare and contrast with our locality optimisation. Using Round-Robin technique, we derive a S-robin stack, a Q-robin queue, and a Crobin counter. Round-Robin provides relaxation bounds, we demonstrate this using S-robin whose bound is given by Theorem 12. 
S-robin Correctness
Theorem 12. S-robin is linearizable with respect to k-out-of-order stack semantics, where k = (2 × #threads − 1)(#sub-stacks − 1).
Proof. Consider the linearization points of Push and Pop operations that respectively insert and remove the item e into and from a sub-stack (let sub-stack 0). Let t push e and t pop e denote these points, respectively. Now, we bound the maximum number of items, that are pushed after t push e and are not popped before t pop e , to obtain k. We denote the number items that are pushed to (popped from) sub-stack i by thread j in the time interval Observe that each thread applies its operations in round robin fashion without skipping any index. If the previous successful Pop had occurred at sub-stack i, the next Pop occurs at sub-stack i + 1(mod #sub-stacks). The same applies for the push operations.
Without loss of generality, assume that thread 0 has inserted item e to sub-stack 0. This implies that ∀i, #sub-stacks − 1 ≥ i > 0, push 
Experimental Evaluation
We experimentally evaluate the performance of our derived 2D algorithms, in comparison to k-outof-order relaxed algorithms available in the literature, and other state of the art data structure algorithms. k-out-of-order relaxed algorithms include; Last recently used queue (lru) [16] , Segmented queue (Q-segment) and k-Stack [2, 20] , other algorithms include; MS-queue (MS-queue) [22] , Wait free queue (wfqueue) [35] , Time stamped stack (TS-Stack ) [12] and Elimination back-off stack (Elimination) [19] . Henceforth, width will be generally used to refer to number of sub-structures (number of access points) for all algorithms using multiple sub-structures.
To facilitate a uniform comparison, we implemented all the evaluated algorithms using the same development tools and environment. The source code can be provided on request and will be made publicly available on publication of this work.
System Description
Experiments are run on two x86-64 machines: (i) Intel Xeon E5-2687W v2 machine with 2 sockets, 8-core Intel Xeon processors each running at 3.4GHz, L2 cache = 256KB, L3 cache = 25.6MB (Multi-S ) and (ii) Intel Xeon Phi 7290 with one 72-core processor running at 1.5GHz, L2 cache = 1024KB (Single-S ). Multi-S and Single-S run on Ubuntu 16.04.2 LTS and CentOS Linux 7 Core Operating systems receptively. The Multi-S machine is used to evaluate inter-socket execution behaviour, whereas Single-S is used to evaluate intra-socket. Threads are pined one per core, for both machines excluding hyper-threading. Inter-socket execution is evaluated through pinning the threads one per socket in round robin fashion. Threads randomly select between Put or Get with a given probability (operation rate). Memory is managed using the ASCYLIB framework SSMEM [9] .
Our main goal is to achieve scalability under high operation rate (contention). To evaluate this, we simulate high operation rate by excluding work between operations. To reduce the effect of N U LL returns 1 for Get operations, all algorithms are initialized with 2 17 items. Each experiment is then run for five seconds obtaining an average of five repeats. Throughput is measured in terms of operations per second, whereas the relaxation behaviour (accuracy) is measured in terms of the error distance from the exact data structure sequential semantics [20] . The higher the error distance, the lower the accuracy.
Measuring accuracy
We adopt a similar methodology used in the literature [4, 25] . A sequential linked-list is run alongside the data structure being measured. For each operation Put or Get, a simultaneous insert or delete is performed on the linked-list respectively, following the exact semantics of the given data structure. A global lock is carefully placed at the data structure linearization points, locking both the linked-list and the data structure simultaneously. The lock allows only one thread to update both the data structure and the linked-list in isolation.
A given thread has to acquire the lock before it tries to linearize on any given sub-structure. Note that, Window search is independent of the lock. Items on the data structure are duplicated on the linked-list and can be identified by their unique labels. Insert operations happen at the head or tail of the list for LIFO or FIFO measurements respectively. A delete operation searches for the given item deletes it and returns its distance from the head (error distance). For counter measurements, we replace the linked-list with a fetch and add (FAA) counter. Both counters are updated in isolation using a lock like explained above. The error distance is calculated from the difference between the two counter values.
Experiment results are then plotted using logarithmic scales, throughput (solid lines) and error distance (dotted lines) sharing the x-axis. 
Dimension Tunability
Our design framework is tunable, giving designers the ability to manage performance optimizations for different execution environments and workloads, within a given tight relaxation bound (k). To evaluate this, we experiment with different parameter configurations, as shown in Figures 2 and 3 for 2Dc-Stack and 2Dd-Queue respectively. Curve (D1) depicts the case for fixed depth = 1 which also represents a case of relaxing in one dimension (horizontally). The other curves (1P,2P,3P,4P,5P and 6P) depict execution in two dimensions. For simplicity, width is described as a multiple of the number of threads (P ). It should however be noted that, width can be configured to be independent of the number of threads.
D1 presents the lowest throughput as k increases. This is attributed to the increasing width proportional to k, leading to increased hops and lack of locality exploitation. On the other hand, we observe improved throughput performance for two dimensional executions. For all measured k, we observe that there is no consistent optimal width configuration. This implies that, an optimal configuration is dependant on other factors, including; k relaxation, type of workload, plus accuracy vs throughput trade-off. There are also notable differences between Single-S and Multi-S results. This calls for a multi-objective optimization model, which is beyond the scope of this paper.
With respect to the evaluated cases, we observe that width = 3×#threads provides a fair balance between accuracy and throughput performance especially for 2Dc-Stack in Figure 2 . However for smaller k, there are varying high throughput points. Since we do not have an optimization model, we empirically obtain the high throughput width configurations for different executions as shown in Table 3 . 3 × P dominates as the high throughput configuration, especially for 2Dc-Stack .
However for Multi-S , we note that smaller width (1P ) can achieve higher throughput perfor- mance. This is attributed to the high inter socket communication cost, having smaller width allows for the exploitation of locality through relaxing more in the depth dimension. Exploiting locality reduces the communication between sockets, in turn, avoiding the inter socket communication cost.
Monotonicity With High Degree of Relaxation
In order to evaluate monotonicity with increasing relaxation bound (k), we fix the number of threads to 16. This is to match the number of cores available on Multi-S without hyper-threading. Results are presented in Figures 4, 5 and 6 for stack, queue and counter respectively. First, we observe the difference between WinCoupled and WinDecoupled for 2Dc-Stack and 2Dd-Stack respectively in Figure 4 . 2Dc-Stack consistently outperforms 2Dd-Stack due to the reduced Window shif ting updates. With 2Dc-Stack , a given thread can locally operate on the same sub-stack longer since operation counts cancel out each other leaving the sub-stack in a valid state. The longer a given sub-stack stays valid, the higher the chances of exploiting locality. This advantage is more evident with symmetric workloads (50% push rate). As the workload becomes more asymmetric (70% push-rate), less Push counts are cancelled out by Pop counts. This implies that, the Window gets full more frequent leading to increased Window shif ts. With 100% asymmetric workloads, 2Dc-Stack and 2Dd-Stack present similar execution behaviour. The same is observed for 2Dc-Counter and 2Dd-Counter in Figure 6 .
All multi sub-structure based algorithms increase their width (number of sub-structures) as k increases to reduce contention and allow for increased disjoint access as shown in Table 2 . However, for k-Stack , Q-segment, and lru, hops increase as width increases, this explains their observed low throughput gain. S-robin, Q-robin and C-robin are not affected by hops. However, for smaller k values, they suffer from high contention arising from contending threads retrying on the same substructure until they succeed. As contention vanishes with high k values, throughput gain saturates due to lack of locality. Round-Robin algorithms take advantage of the hardware prefetching available on both Single-S and Multi-S machines to reduce on the downside effect of lack or locality. This explains the observed throughput gain as width increases. 2D algorithms maintain throughput gain through limiting width to a size beneficial to reducing contention and switch to adjusting the depth to reduce hops. For our evaluation, 2D algorithms' width is configured as shown in Table 3 . Once the algorithm attains the configured width, it switches to increasing depth as k increases. The depth parameter allows 2D algorithms to maintain throughput gain (monotonicity) through exploiting locality while reducing latency. This is observed for both Single-S and Multi-S machines.
In terms of accuracy, we observe an almost linear decrease in accuracy as k increases for all algorithms. However, for 2Dc-Stack we observe a reduced rate of accuracy loss when the algorithm switches to increasing depth.
Scaling With Threads
To evaluate the scalability of our design as the number of threads increases, we fix the relaxation bound to (k = 10 4 ) and vary the number of threads as shown in Figures 7, 8 and 9 for stack, queue and counter respectively. The reason for k = 10 4 is to reduce the effect of contention due to small width at lower k values. This helps us focus on scalability effects. 2D algorithms' width is configured as shown in Table 3 . Random and Random-C2 algorithms' width is set to 3 × #threads, as the optimal balance between throughput and accuracy since both of them do not provide k relaxation bounds [26] .
k-Stack and Q-segment maintain a constant segment size as the number of threads increases. This increases the rate at which segments get filled up, leading to a high frequency of hops and segment maintenance cost especially for asymmetric workloads. As observed, throughput gain quickly saturates even for a lower number of threads leading to limited scalability.
The scalability of lru is mostly limited by the global counter used to calculate the last recently used sub-queue. For every operation, the thread has to increment the global counter using a FAA instruction, turning the counter into a scalability bottleneck. This can be observed when lru performance is compared to that of a single FAA counter (C-FAA). wfqueue suffers from the same FAA counter sequential bottleneck.
TS-Stack 's throughput is limited by the Pop search retries, searching for the newest item. Moreover, Pop operations might contend on the same newest items if there are not enough concurrent Push operations. Also, Pop lacks locality, which explains the drop in throughput on the Multi-S machine, due to the high inter-socket communication costs. We observe that throughput increases with increased Push rate. This is due to increased local processing and increased number of generated young items, leading to reduced thread contention for Pop operations.
For Round-Robin algorithms, the width is inversely proportional to the number of threads (See Theorem 12). As the number of threads increases, width reduces leading to increased contention. This explains the observed drop in throughput for a high number of threads, especially for the S-robin and the C-robin algorithms due to their sub-structure single access. The effect of lack of locality can be reduced by hardware pre-fetching, a feature available on both machines. This can also explain the Round-Robin better performance compared to the performance of the other algorithms that lack locality.
Random and Random-C2 algorithms are affected by the lack of locality, which is evident by the difference between Single-S and Multi-S results. We observe that the performance difference between Random and 2D algorithms increases on the Multi-S machine as compared to that on the Single-S machine. This demonstrates how much 2D algorithms gain from exploiting locality when executing on a Multi-S machine. Locality helps to avoid paying the high inter-socket communication cost through improved caching behaviour [17, 8, 28] .
We observe a change in accuracy as the number of threads increases for all algorithms. This suggests that width has an effect on accuracy.
Conclusion
In this work, we have shown that semantics relaxation has the potential to monotonically trade relaxed semantics of concurrent data structures for achieving throughput performance within tight relaxation bounds. This has been achieved through an efficient two-dimensional framework that is simple and easy to implement for different data structures. We demonstrated that, by deriving two-dimensional lock-free designs for stacks, FIFO queues, dequeues and shared counters.
Our experimental results have shown that relaxing in one dimension, restricts the capability to control relaxation behaviour in-terms of throughput and accuracy. Compared to previous solutions, our framework can be used to extend existing data structures with minimal modifications while achieving better performance in terms of throughput and accuracy.
