2D-Stack: A scalable lock-free stack design that continuously relaxes semantics for better performance by Rukundo, Adones et al.
Technical Report no. 2018:06
2D-Stack:
A scalable lock-free stack design that continuously relaxes
semantics for better performance
Adones Rukundo, Aras Atalar, Philippas Tsigas
Email: {adones, aaras, tsigas}@chalmers.se
Distributed Computing and Systems
Department of Computer Science and Engineering
Chalmers University of Technology and Go¨teborg University
SE-412 96 Go¨teborg, Sweden
Go¨teborg, 2018
2D-Stack Technical Report no. 2018:06, ISSN: 1652-926X
.
Technical Report in Computer Science and Engineering at
Chalmers University of Technology and Go¨teborg University
Technical Report no. 2018:06
ISSN: 1652-926X
Department of Computer Science and Engineering
Chalmers University of Technology and Go¨teborg University
SE-412 96 Go¨teborg, Sweden
Go¨teborg, Sweden, 2018
2
2D-Stack Technical Report no. 2018:06, ISSN: 1652-926X
Abstract
In this report, we propose an efficient lock-free concurrent stack design with tunable and
tenable relaxed semantics to allow for better performance. The design is materialized by a
shared memory distributed stack design, that allow for a continuous monotonic trade of weaker
semantics for better throughput performance. Concurrent stacks have an inherent scalability
bottleneck due to their single access point for both push and pop operations.
Elimination and semantics relaxation have been proposed in the literature to address this
problem. Semantic relaxation has the potential and flexibility to reach monotonically very
high throughput. Previous solutions could not fully leverage this potential. We propose a new
two-dimensional design that can achieve this by exploiting disjoint access parallelism in one
dimension and locality in the other. This is achieved through distributing the stack in form
of sub-stacks that are accessed independently in parallel. Load balancing is used to keep a
balanced number of operations on individual sub-stacks.
We also provide tight relaxation bounds for the behaviour of our algorithm. We compare
experimentally to previous work, with respect to throughput and relaxed behaviour observed, on
different relaxation and concurrency settings. The results show that our algorithm significantly
outperform all other algorithms in terms of performance, while maintaining better quality in
contrast to other designs with relaxed semantics.
Keywords— Stack, Lock-free, Relaxed Semantics, Concurrency, Data Structures, Weak Con-
sistence, Distributed Algorithms.
3
2D-Stack Technical Report no. 2018:06, ISSN: 1652-926X
1 Introduction
Stacks entered the computer science literature in 1946, when Alan M. Turing used the terms ”bury”
and ”unbury” as a means of calling and returning from subroutines. Stacks are linear data structures
or more abstractly sequential collections of ordered items with two principle operations: Push, which
adds an item and pop, which removes an item. The Push and Pop operations occur at a single
point of the structure, referred to as the top or head of the stack. Stack operations follow the
Last In First Out (LIFO) semantics. Concurrent stacks, like any other concurrent data structure,
require synchronization to guarantee the behavior that is legal with respect to their exact sequential
specifications. However, synchronization may incur high performance overhead with increase in the
number of threads. In the case of a stack, the overhead can be attributed to the joint access point at
the stack top leading to contention and a scalability bottleneck. Synchronization is vital to achieving
correctness and cannot be eliminated [5], whereas this is true, synchronization and scalability conflict
in form of contention. To reduce contention and improve on scalability, synchronization points
need to be distributed by creating disjoint access points. Disjoint access techniques for stacks like;
elimination [1, 12, 17], combining [18] and dynamic elimination-combining [7] have been proposed
in the literature. Such techniques allow threads to complete their operations without necessarily
accessing the top of the stack. These techniques however depend on the existence of other concurrent
operations being present, introducing a waiting time between dependant operations. For example
Push waiting for Pop elimination and Push waiting for Push combining.
To further improve performance scalability of concurrent data structures, recent research has
focused on expanding the set of legal behaviours, including; weakening consistency and semantic
relaxation for providing trade-offs between scalability and linearizability guarantees. Computability
of relaxed data-structures [16] together with their relaxed semantics definitions including; k-Out-of-
Order , k-Lateness and k-Stuttering have been proposed in the literature as interesting relaxation
models to consider [13, 20].
Distributing parts and hence access of the data-structure [11, 15, 21], has come out as a frequent
technique used to implement relaxation. A given data-structure is split into multiple sub-structures
with independent access points to improve on disjoint access parallelism. Operations are distributed
over the sub-structures using different scheduling techniques; thread binding [21], random access
[15], load-balancing [11], round robin and a combination of others. Various relaxed data-structures
have been proposed in the literature, most use one dimension relaxation exploiting disjoint access
parallelism or locality. Disjoint access is achieved through creating extra access points, whereas
locality is achieved by controlling the number of threads that share a given access point.
In this report, we aim to leverage the semantics relaxation through exploiting disjoint access par-
allelism and locality. Locality can be obtained by letting a thread work on the same access point for
some time in isolation. Previous works have used thread to sub-structure bidding to exploit locality.
However, to maintain LIFO accuracy bounds and differ from pool semantics [19, 10, 1], a mechanism
that synchronizes the thread local works or limits the amount of the local work must be introduced.
This might turn out to be the performance bottleneck as one increases the degree of concurrency
[10]. We introduce an algorithm (2D-stack) that enables disjoint access parallelism and exploits
the locality within strict deterministic accuracy bounds in an efficient way, avoiding expensive work
sharing mechanisms. This would not only increase the performance for a given configuration but also
give one the capability to monotonically trade the accuracy for better performance. This is achieved
through a distributed stack, composed of multiple sub-stacks. Each sub-stack independently ac-
cessed through a pointer pointing to the topmost item of the given sub-stack . We also implement
three other distributed stack designs based on known basic scheduling techniques, including Random
Single Choice (Random), Random Choice of Two (Random-c2 ), Round-Robin (k-robin). This is to
help us have a detailed comparison of stack semantic relaxation, since such designs have not been
proposed with relaxed semantics before. Our 2D-stack design exhibits a two dimension scheduling
technique (load balancing), exploiting disjoint access parallelism in one dimension and locality in
the other. 2D-stack significantly outperforms previous stack implementations including the extra
4
2D-Stack Technical Report no. 2018:06, ISSN: 1652-926X
three implemented stack designs as observed in the experimental evaluation Section 7.
The report is structured as follows. In Section 2 we discuss literature related to this work. We
present the implementations in Section 3 and prove correctness in Section 4. A step complexity
analysis is discussed in Section 5. We select optimization parameters in Section 6 and use them for
experimental evaluation presented in Section 7. We then conclude in Section 8
2 Related Work
Concurrent stacks are inherently sequential due to their single point access bottleneck. In the
quest to improve performance scalability, disjoint access strategies have been proposed for designing
concurrent stacks including; elimination trees [1, 17], combining funnels [18] and elimination back-off
[7, 12]. Elimination back-off implements a collision array in which pop operations try to collide and
cancel with concurrent push operations to reduce joint access on top of the stack. Such operation
pairs create disjoint collisions that are executed in parallel with operations accessing the main stack
implementation. As an extended back-off strategy, it reduces joint access to the main stack by
canceling out paired operations and completing their execution on the collision array. However,
the performance benefits flatten out fast as the number of threads increases to a certain threshold.
Elimination back-off mostly benefits symmetric workloads in which the numbers of push and pop
operations are roughly equal, its performance deteriorates when workloads are asymmetric.
Recently, semantic relaxation has been proposed for data-structures that provide trade-offs be-
tween scalability and linearizability guarantees. Relaxation techniques introduce an acceptable error
within the legal strict semantics of a given data-structure, i.e. the pop operation of a relaxed stack
is allowed to return any of the k topmost items of the stack. To quantify this error, relaxed semantic
definitions including; k-Out-of-Order , k-Stuttering and k-Lateness have been introduced [13, 20].
Based on these definitions, new designs have been proposed for some fundamental data structures to
introduce relaxation. A k-Out-of-Order stack has been proposed in [13, 2], referred to as Segmented
(k-segment) henceforth. It is composed of a linked list of memory segments whose size is defined by
k number of indexes. The stack items can only be accessed through the topmost segment, where an
operation pushes or pops an item from any k indexes. A Push operation adds a new segment if top
segment is full whereas a Pop removes a segment if it is empty and not the last segment. A Push
operation tries to push an item onto an empty slot in the top segment, adding a new segment if the
segment is full. A Pop operation tries to remove an item from the top segment, removing segment
if empty and is not the only segment on the stack. Operations perform a linear search for an empty
(Push) or filled (Pop) index, starting with randomly index in the topmost segment. Relaxation
is controlled through the width dimension with the segment size increasing with increase in of k.
Increasing k improves on disjoint access and reduces contention up-to a given thresh hold. This is
because as k increases to infinity, at some point the gains from the reduced contention diminishes
whereas the cost for traversing memory in search of an index increases. This is coupled with accu-
racy loss proportional to increase in k. Also for small k numbers, there is high cost of operation
synchronization when trying to remove or add a segment. These performance characteristics of the
proposed design, limit scaling and the performance gains of the algorithm with increase in relaxation.
Other relaxed data-structures proposed include, priority queues [4, 15, 21] and shared memory
distributed FIFO queues [11].
3 Relaxed Stack Design Description
In this section, we describe our 2D-Stack two dimensional design plus other three related distributed
stack designs that we implement for the sake of a detailed study. We follow the same sub-stack
design with different thread scheduling techniques. An array of pointers (stack-array) is used to
access individual sub-stacks. Each sub-stack is independently accessed through its topmost item
5
2D-Stack Technical Report no. 2018:06, ISSN: 1652-926X
pointer at a given stack-array index. Operations on each sub-stack follow the lock-free Treiber stack
design [8].
The three extra scheduling techniques lead to different stack algorithms that reveal different
performance and accuracy characteristics. Random and Random-c2 , are simple randomized stack
algorithms, they use the width dimension to distribute operations by randomly selecting a given
sub-stack to operate on. Random-c2 increases accuracy by employing the random choice of two
technique. Our third algorithm k-robin uses the width dimension to distribute operations following
a strict round robin sub-stack selection.
3.1 2D-Stack
2D-Stack algorithm uses three parameters to tune its performance: width, depth and shift which
form a count based bidirectional operational region (window) in which an operation can occur.
Number of sub-stacks is defined by the width whereas depth defines the maximum number of items
acceptable for a single sub-stack per window. We implement a global counter (Global) that limits
the depth, by defining the maximum and minimum number of items per sub-stack for a given active
window. The window and Global give us the liberty to optimize for both accuracy and throughput
within tightly defined accuracy bounds. A given thread randomly selects a sub-stack to operate
on, tries to keep on the same sub-stack for as long there is no contention and without violating
the Global restrictions. If contention is detected on a given sub-stack , the thread randomly selects
another sub-stack to reduce on contention. This allows threads to optimistically exploit locality
without thread to sub-stack bidding.
The algorithm operations are depicted in Algorithm 1. 2D-stack is a shared memory distributed
stack, composed of multiple lock-free sub-stacks. An individual sub-stack is implemented using
a linked list whose operations follow the Treiber stack design [8]. Each sub-stack has a unique
descriptor (line 1 to 4) that keeps track of the sub-stack information including; pointer to the
topmost item and item-counter. A descriptor has a dedicated memory location accessed through an
array (stack-array). Using a CAE1 instruction we can update the descriptor contents in one atomic
step to maintain correctness (line 15 and 27 for Push and Pop respectively).
To perform an operation, a thread searches for a sub-stack based on the Global (GetIndex). A
thread selects a sub-stack , then, compares the sub-stack item-count with the Global (line 61 or 65).
The thread can then proceed on the selected sub-stack only if the comparison evaluates to true (line
46 or 48). Otherwise the thread has to search for another sub-stack . For each operation, the thread
starts from the previously known sub-stack on which it succeeded (line 44). First the thread tries
a given number of random hops (line 50), then switches to round robin until a valid sub-stack is
found, or the thread updates the Global , after failing on all sub-stacks (line 64 or 68).
The Global is updated in relation to depth. If the thread detects contention on a sub-stack , a
random hop to another sub-stack is performed (line 18 or 30). This is to reduce possible contention
on consecutive sub-stacks that might arise from round robin hops. It also introduces our concept of
optimistic locality. A thread can operate on given sub-stack for as long no contention is detected. A
CAE fail signals the presence of another thread operating on the same sub-stack . To avoid further
contention, the thread that has failed leaves the successful thread to take over the sub-stack .
During the search, the thread validates each sub-stack item-count against the Global . The item-
count must be less than Global for Push or greater than the difference between Global and depth
for Pop (line 45 or 47). If the item-count is zero, then the sub-stack is empty. If no valid sub-stack
is found, the Global is updated atomically (line 64 or 68). Push adds whereas Pop subtracts a value
(shift) (line 62 or 66), shift must be less than or equal to depth. Then the search is restarted with
a fresh search count. If a valid sub-stack is found, the thread tries to operate on it, on success the
sub-stack descriptor is updated (line 15 or 27) otherwise another sub-stack is searched for, starting
from a random index (line 18 or 30).
1Compare and Exchange (CAE) atomically compares 16 bytes of memory content and exchanges it with new
content on success.
6
2D-Stack Technical Report no. 2018:06, ISSN: 1652-926X
A successful Push increments whereas a Push decrements the item-counter by one. Also the
topmost item pointer is updated. At this point, a Push adds an item whereas a Pop returns an item
for a non empty sub-stack or NULL for empty. An empty sub-stack is represented by a NULL item
pointer within the descriptor. As an optimization strategy, the thread keeps track of the Global for
every hop during the search process, restarting for every Global change detected. Keeping track of
the Global prevents the thread from doing useless search with stale Global information. Consider
a Push thread that reads Global just before it is incremented by a preceding Push thread . The
succeeding thread will have to check all sub-stacks searching for non full sub-stack before proceeding
to access the updated Global . Note that, keeping track of the Global , has no added cost apart from




3 count; unsigned long
4 version; unsigned long
5 Struct Global
6 count; unsigned long
7 version; unsigned long
8 Function Push(NewItem)
9 while true do
10 {Des,Index} = GetIndex(push,Index);
11 NewItem.next = Des.item;
12 NDes.item = NewItem;
13 NDes.count = Des.count + 1;
14 NDes.version = Des.version + 1;







22 while true do
23 {Des,Index}=GetIndex(pop,Index);
24 if Des.item != NULL then
25 NDes.item = Des.item.next;
26 NDes.count = Des.count - 1;









36 . Des stands for descriptor;
37 . Glo stands for Global;
38 Function GetIndex(Op,Index)
39 IndexSearch = 0; PGlo = Glo; Random = 0;
40 while true do
41 if Index ≥ ArraySize then
42 Index = 0;
43 end
44 Des = Array[Index] . Read descriptor;
45 if Op == push ∧ Des.count < Glo.count then
46 return {Des,Index};
47 else if Op == pop ∧ (Des.count ≥ (Glo.count
- depth) ∨ Des.count == 0) then
48 return {Des,Index};
49 else if PGlo == Glo then




54 Index += 1;
55 end
56 else
57 PGlo = Glo; IndexSearch = 0;
58 end
59 if IndexSearch == ArraySize then
60 IndexSearch = 0;
61 if Op == push ∧ PGlo == Glo then
62 NGlo.count = PGlo.count + ShiftUp;
63 NGlo.version = PGlo.version + 1;
64 CAE(Glo,PGlo,NGlo);
65 else if Op == pop (∧ PGlo == Glo ∧
(Glo.count - depth) > 0) then
66 NGlo.count = PGlo.count - ShiftDown;
67 NGlo.version = PGlo.version + 1;
68 CAE(Glo,PGlo,NGlo);
69 end
70 PGlo = Glo;
71 else if Random ≥ 2 then
72 IndexSearch += 1;
73 end
74 end
3.2 Other Shared Memory Distributed Stack Designs
In this section, we present other distributed stack designs based on known basic scheduling tech-
niques. They are briefly described to give the reader an implementation overview for a better
performance comparison with the 2D-stack .
7
2D-Stack Technical Report no. 2018:06, ISSN: 1652-926X
3.2.1 Round-Robin
This algorithm uses two parameters to tune its performance: number of threads and number of
sub-stacks. Unlike Random and Random-c2 algorithms, k-robin provides a deterministic accuracy
bound, linearizable with respect to k-Out-of-Order stack semantics. The algorithm distributes oper-
ation following a strict round robin fashion without skipping a sub-stack . Each thread has two local
independent counters, a Pop operation counter and a Push operation counter. A thread tries to op-
erate on a sub-stack indicated by a given operation counter, if successful, it increases the respective
counter for the next operation. Otherwise it keeps trying on the same sub-stack until it succeeds.
3.2.2 Random Single Choice
This is the most basic algorithm designed on top of the sub-stack design. It takes a single parameter:
number of sub-stacks (width). For both Pop and Push operations, a thread selects a sub-stack uni-
formly at random to perform its operation. Once the sub-stack is selected, the respective operation
follows the Treiber stack design.
3.2.3 Random Choice of Two
In the bid to improve on accuracy , Random Single Choice is extended to Random. Random-c2
design is based on the principle of power of random two choices[14], also similar to MultiQueues
[6] and Power of choice of two [3]. Like in Random the number of sub-stacks remains as the only
parameter to select for tuning. The algorithm is depicted in Algorithm 2. Each pushed element is
tagged with a time-stamp generated using a globally consistent clock (line 5). Time-stamps provide
for a logical global ordering of elements. A Pop operation randomly selects two sub-stacks and
proceeds to pop from the sub-stack whose element has the highest time-stamp (line 25). A Push
operation, randomly selects a sub-stack and proceeds to push the element onto it (line 4).
Algorithm 2: Random Choice of Two
1 Function PushItem (*NewItem)
2 *Item;
3 while true do
4 index = RandomIndex(); Item = StackArray[index].item;
5 NewItem→next = Item; NewItem→tag = Timestamp;




10 Function PopItem ()
11 *Item; *NewItem; tag1 = 0; tag2 = 0;
12 while true do
13 index1 = RandomIndex();
14 while StackArraySize>1 do
15 index2 = RandomIndex();
16 if index1 != index2 then
17 tag1 = Item1→tag; tag2 = Item2→tag;
18 if tag1 > tag2 then
19 index = index1; break ;
20 else




25 Item = StackArray[index].item;
26 if Item != NULL then
27 NewItem = Item→next;








2D-Stack Technical Report no. 2018:06, ISSN: 1652-926X
4 Correctness
In this section, we prove the correctness of our algorithms. We examine and prove linearizability
and lock-freedom for both k-robin and 2D-stack . We do not consider Random and Random-c2 in
this section, since the k would be unbounded (proportional to the maximum number of items in the
stack) for these two algorithms.
To begin with, we introduce the linearization points of Push and Pop operations for both stack
designs; the linearization points are the same points (same program lines) that the original Treiber’s
Stack implementation had as lineraiztion points. For 2D-stack , Pop linearizes either by returning
NULL (at line 33) or with a successful CAE at line 27. Push linearizes with a successful CAE at
line 15.For k-robin, Pop linearizes either by returning NULL or with a successful CAS that pops
an item by modifying the top pointer of a sub-stack . Push linearizes with a successful CAS that
modifies the top pointer of a sub-stack .
We prove the linearizability of 2D-stack and k-robin with respect to the sequential semantics of
k-Out-of-Order stack [13], which provides a relaxed version of the LIFO semantics. Relaxation can
be applied method-wise and it is applied only to Pop operations in k-Out-of-Order stack, i.e. a Pop
pops one of the topmost k items. Push operations add the item to the top of the stack.
4.1 2D-Stack
Firstly, we require some notation. window defines the active region in which the operations are
allowed to proceed (line 45 and 47 for Push and Pop respectively). The window is shifted by the
parameter shift, 1 ≤ shift < depth. A window i ( Wupi ) has an upper bound ( Wupi ) and a lower
bound (W downi ), that are defined by W
up
i = depth+(i×shift) and W downi = i×shift, respectively.
And, a window is active iff Wupi = Global. The width parameter describes the number of sub-stacks.
The number of items of the sub-stack j is denoted by Nj , 1 ≤ j ≤ width. To recall, the top pointer,
the version number and Nj are embedded into the descriptor of sub-stack j and all can be modified
atomically with a CAE.
Lemma 1 Given that Global = depth + shift × i, it is impossible to observe a state(S) such that
Nj > W
up
i+1 (or Nj < W
down
i−1 ), where 1 ≤ j ≤ width.
Proof: Recall that Global = Wupi defines the active window where the operations are allowed to
start. Though, they might linearize while the active window is set to an adjacent window (Global =
Wi−1 or Global = Wi+1). We can not observe such a state in the initialization, therefore there should
exist a point in time that this state (S) is observed for the first time, with Global = depth+shift× i
and Nj > W
up
i+1 (or Nj < W
down
i−1 , but we do not consider this symmetric case in the proof since it
can be covered with the same arguments as Nj > W
up
i+1). Now, we show that this is impossible by
considering the interleaving of operations.
Without loss of generality, assume thread 1 (T1) has set Global = depth+shift× i with the CAE
(at line 64 or 68) at time tl1. To do this, T1 should have observed either Global = depth+shift×(i−1)
and then Nj = W
up
i−1 or Global = depth+shift× (i+1) and then Nj = W downi+1 . Let this observation
of Global (at line 39) happen at time t1. Consider the last successful push operation at sub-stack
j before the state S is observed for the first time (we do not consider Pop operations as they can
only decrease Nj to a value that is less than W
up
i+1, this case will be covered by the first item below).
Assume thread 0 (T0) sets Nj to Nj > W
up
i+1 in this push operation. In this operation, T0 should
observe Nj ≥ Wupi+1 (at line 44) and Global > Wupi+1 (at line 45). Let line 45 is executed (atomic
read) at time t0. And the linearization of the operation happens at tl0 > t0 with CAE (at line 15).
• If tl0 < t1, the concerned state(S) can not be observed since, Global can not be changed (to
depth+ shift× i) after Nj > Wupi+1 is observed.
• Else if tl1 < t0, the concerned state(S) can not be observed since, the push operation can not
proceed after observing Global (at line 45) with such Nj .
9
2D-Stack Technical Report no. 2018:06, ISSN: 1652-926X
• Else if t1 > t0, then T0 can not succeed in the CAE (at line 15) because this implies that Nj
has been modified (the difference between the value of Global that is observed by T0 and then
by T1 implies this) since T0 have read the descriptor (at line 44), at least the version numbers
would have changed since then, thus leading to a failed CAE (at line 15).
• Else if t1 < t0, then this implies Global has been modified since it had read by T1 (at line 39),
thus CAE (at line 64 or 68) would fail, at least based on the version number.
Hence, the lemma.
Lemma 2 At all times, there exist an i such that ∀j, 1 ≤ j ≤ width: W downi ≤ Nj ≤Wupi+1.
Proof: Informally, the lemma states that the size of the sub-stacks spans to at most two consecutive
windows. Assume that the statement is not true, then there should exist a pair of sub-stacks (let k
and j) at some point in time such that ∃i,Nj > Wupi+1 and Nk < W downi . One can not observe such
Nj and Nk at the initialization. Then, there should exist a time t that this is observed for the first
time. Consider the last push operation at sub-stack j and last pop operation at sub-stack k that
linearize before or at the time t.
Assume thread 0 (T0) sets Nj and thread 1 (T1) sets Nk. To do this, T0 should observe Nj ≥Wupi+1
(at line 44) and Global > Wupi+1 (at line 45), let line 45 (atomic read of Global) is executed at t0. And,
the linearization of the Push operation occurs at tl0 > t0 with the CAE (at line 15). Similarly, for the
Pop operation of T1, let line 47 is executed at t1 and the observed value should be Global ≤W downi .
And, let the Pop operation linearize at time tl1 > t1 with the CAE (at line 27). Now, we consider
the possible interleavings.
• If tl0 < t1 (or the symmetric tl1 < t0 for which we do not repeat the arguments), then for T1
to proceed and pop an item from sub-stack k, it is required that Global ≤W downi (at line 47).
Based on Lemma 1, this is impossible when Nj > W
up
i .
• Else if t1 > t0, then T0 can not succeed in the CAE (at line 15) because this implies that Nj
has modified (the difference between the value of Global that is observed by T0 and then by
T1 implies this) since T0 have read the descriptor (at line 44). At least, the version number
would have changed since then, thus leading to a failed CAE (at line 15).
• Else if t0 > t1, the argument above holds for T1 too, so T1 should fail at the CAE (at line 27).
Such Nj and Nk pair can not co-exist at any time, hence, the lemma.
Theorem 3 2D-stack algorithm is linearizable with respect to k-Out-of-Order stack semantics, where
k = (2shift+ depth)(width− 1).
Proof: Consider the linearization points of the Push and the Pop operations that insert and
remove the item e into and from a sub-stack (let sub-stack j). Let tpushe and t
pop
e denote these points
respectively, tpope > t
push
e . Now, we bound the maximum number of items, that are pushed after
tpushe and are not popped before t
pop
e , to obtain k. Let Nj become x with the linearization of the
push operation that inserts item e. In other words, item e is the xth item from the bottom of the
sub-stack . Consider a window i such that: W downi ≤ x ≤Wupi .
Lemma 2 states that the sizes of the sub-stacks should reside in a bounded region. Relying
on Lemma 2, we can deduce that at time tpushe , the following holds: ∀i : Ni ≥ W downi − shift.
Similarly, we can deduce that at time tpope , the following holds: ∀i : Ni ≤ Wupi + shift. Therefore,
the maximum number of items, that are pushed to sub-stack i (i 6= j) after tpushe and are not popped
before tpope is at most W
up
i + shift− (W downi − shift) = depth+ 2shift. We know that this number
is zero for sub-stack j (the sub-stack that e is inserted) and we have width − 1 other sub-stacks.
So, there can be at most (depth+ 2shift)(width− 1) items that are pushed after tpushe and are not
popped before tpope . Hence, the theorem.
10
2D-Stack Technical Report no. 2018:06, ISSN: 1652-926X
4.2 Round-Robin
Theorem 4 k-robin is linearizable with respect to k-Out-of-Order stack semantics, where k = (2P−
1)(width − 1). P stands for the total number of concurrent threads and width denotes the number
of sub-stacks.
Proof: Consider the linearization points of Push and Pop operations that respectively insert and
remove the item e into and from a sub-stack (let sub-stack 0). Let tpushe and t
pop
e denote these points,
respectively. Now, we bound the maximum number of items, that are pushed after tpushe and are
not popped before tpope , to obtain k. We denote the number items that are pushed to (popped from)
sub-stack i by thread j in the time interval [tpushe , t
pop





Observe that each thread applies its operations in round robin fashion without skipping any
index. If the previous successful Pop had occurred at sub-stack i, the next Pop occurs at sub-stack
i+ 1(mod width). The same applies for the push operations.
Without loss of generality, assume that thread 0 has inserted item e to sub-stack 0. This implies
that ∀i, width − 1 ≥ i > 0, push00 ≥ push0i . Now, take another thread j, we have ∀i : width − 1 ≥
i > 0, pushj0 ≥ pushji − 1. Informally, another thread can increase the number of items on any other
sub-stack by at most one more compared to the number number of items that pushes on sub-stack
0.
For the pop operations, we have the same relation for all threads: ∀i, width ≥ i > 0, popj0 ≥
popji + 1. Informally, a thread can pop at most 1 item less from any other sub-stack compared to
the number that it pops from sub-stack 0. As the interval [tpushe , t
pop
e ] starts with the push and ends








0 = Y .
Summing over all threads and sub-stacks other than sub-stack 0, we get at most (Y + T −
1)(width− 1) Push operations in the interval [tpushe , tpope ]. Summing over all threads and sub-stacks
other than sub-stack 0, weget at least (Y − T )(width − 1) Pop operations. Which leads to the
theorem: k ≤ ((Y + T − 1)− (Y − T ))(width− 1) = (2T − 1)(width− 1)
4.3 Lock-freedom
All algorithm designs presented in this study are lock-free. This follows from the properties of
the lock-free Treiber Stack design except for 2D-Stack . An operation can fail on CAE only if
there is another successful operation ensuring the system progress. For the 2D-Stack , one should
additionally consider if there is a possibility of live-lock due to the update of Global that determines
the active window. The Global can be updated repeatedly back and forth if two opposite operations
follow each other on an empty or full active window. For example, a Pop operation might read an
empty window and update Global leading to a full window, but before it performs its operation,
a subsequent Push reads the full window and updates Global leading to an empty window. This
process can continue forever leading to a system live lock. This can however be avoided by setting
the shift parameter to less than depth. With this setting, a Global update can never lead to a full or
empty window unless if the stack is empty. Therefore, a thread would eventually proceed and reach
to the CAE at the end of a Push or a Pop that can only fail due to another concurrent successful
operation.
5 Complexity Analysis
In this section, we analyze the 2D-stack expected step complexity of a sequential process where
a single thread applies the sequence of operations. The type of an operation in the sequence is
determined with an independent coin toss with a fixed probability, where p denotes the probability
of a Push operation. With distributed access points, it is possible to make multiple hops on different
access points before finding an appropriate point to complete a given operation.
11
2D-Stack Technical Report no. 2018:06, ISSN: 1652-926X
Global regulates the size of the sub-stacks. Recall that the number of sub-stacks is denoted by
width and the size of sub-stack i by Ni. Push and Pop operations are allowed to occur at sub-stack
i, if Ni ∈ [Global − depth,Global − 1] and Ni ∈ [Global − depth + 1, Global], respectively. This
basically means that, at any time, the size of a sub-stack can only variate in the vicinity of Global,
more precisely: ∀i, (Global−depth) ≤ Ni ≤ Global. To recall, this interval is valid for the sequential
process. We refer to this interval as the active region.
We introduce random variables Nactivei = Ni−(Global−depth), Nactivei ∈ [0, depth] that provides
the number of items in the active region of the sub-stack i and the random variable Nactive =∑width
i=1 N
active
i provides the total number of items in the window.
As mentioned before the 2D-stack tries to exploit locality, thus, a thread starts an operation with
a query on the sub-stack where the last successful operation occurred. This means that the thread
hops iff Nactivei = 0 or N
active
i = depth respectively for a Pop or a Push operation. Therefore,
the number of sub-stacks, whose active regions are full, is given by b(Nactive/depth)c at a given
time, because the thread does not leave a sub-stack until its active region gets either full or empty.
If the thread hops a sub-stack , then a new sub-stack is selected uniformly at random from the
remaining set of sub-stacks. If none of the sub-stacks fulfills the condition (which implies that
Nactive =0 at a Pop or Nactive = depth×width at a Push), then the window shifts based on a given
shift parameter. (i.e. for a Push operation Global = Global + shiftup and for a Pop operation
Global = Global − shiftdown, where 1 ≤ shiftdown ≤ depth). One can observe that the value of
Nactive before an operation defines the expected number of hops and the slide of the window.
To compute the expected step complexity of an operation that occurs at a random time, we
model the random variation process around the Global with a Markov chain, where the sequence of
Push and Pop operations lead to the state transitions. As a remark, we consider the performance
of the sub-stacks mostly when they are non-empty, since Pop (NULL) and Push would have no
hops in this case. The Markov chain is strongly related to Nactive. It is composed of K + 1
states S0,S1, . . . ,SK , where K = depth × width. For all i ∈ J0,KK, the operation is in state Si iff
Nactive = i. For all (i, j) ∈ J0,K + 1K2, P (Si → Sj) denotes the state transition probability, that is
given by the following function, where p denotes the probability of a Push:
P (Si → Si+1) = p, if 0 < i < K
P (Si → Si−1) = 1− p, if 0 < i < K
P
(Si → SK−(shift×width−1)) = p, if i = K
P
(Si → S(shift×width−1)) = 1− p, if i = 0
P (Si → Sj) = 0, otherwise
The stationary distribution (denoted by the vector pi = (pii)i∈J0,KK) exists for the Markov chain
(let P (Si → Si) = , i ∈ {depth,K − depth}, for some Pop returning NULL), since the chain is
aperiodic and irreducible. The left eigenvector of the transition matrix with eigenvalue 1 provides
the unique stationary distribution.
Lemma 5 For the Markov chain that is initialized with p = 1/2 and shift, where l = shift ×
width−1, the stationary distribution is given by the vector pil = (pil0pil1..pilK), assuming K−l >= l (for
l > K − l, one can obtain the vector from the symmetry pil = piK−l): (i) pili = i+1(l+1)(K+1−l) , if i < l;
(ii) pili =
l+1
(l+1)(K+1−l) , if l ≤ i ≤ K − l; (iii) pili = K−i+1(l+1)(K+1−l) , if i > K − l.
Proof: We have stated that the stationary distribution exist since the chain is aperiodic and
irreducible for all p and shift. Let (Mi,j)(i,j)∈J0,KK2 denotes the transition matrix for p = 1/2 and
shift. The stationary distribution vector pil fulfills, pilM = pil, that provides the following system of




























In case, l = K − l, then (iv) and (v) are replaced with 2pil(l=K−l) = pili−1 + pili+1 + pil1 + pilK .
12
2D-Stack Technical Report no. 2018:06, ISSN: 1652-926X
Based on a symmetry argument, one can observe that, for all l, pili = pi
l
K−i the system can be
solved in linear time (O(K)) by assigning any positive (for irreducible chain pili > 0) value to pi
l
0.
The stationary distribution is unique thus for any pil0, pi
l spans the solution space. We know that∑K
i=0 pi
l
i = 1, starting from pi
l
0 = 1, we obtain and normalize each item by the sum.
An operation starts with the search of an available sub-stack . This search contains at least a
single query at the sub-stack where the last success occurred, therefore we define the rest of the
hops as the extra hops (denoted by Hop, and they can be at most width). In addition, the op-
eration might include the slide of the window, as an extra step, denoted by Glo. We denote the
number of extra steps with Extra = Hop + Glo. With the linearity of expectation, we obtain
E (Extra) = E (Hop) + E (Glo). Relying on the law of total expectation, we obtain: (i) E (Hop) =∑K
i=0
∑




op∈{pop,push} E (Glo|Si, op)P (Si, op);
where P (Si, op) denotes the probability of an operation to occur in state Si. We analyze the algo-
rithm for the setting where shift = depth and p = 1/2. We do this because the bound, that we
manage to find in this case, is tighter, and gives a better idea of the influence of the parameters to
the expected performance. For this case the stationary distribution is given by Lemma 5.
Theorem 6 For a 2D-Stack that is initialized with parameters depth, width, shift = depth and
p = 1/2, E (Extra) = O( lnwidthdepth ).
Proof: Firstly, we consider the expected number of extra steps for a Push operation. Given that
there are Nactive items, a Push attempt would generate an extra step if it attempts to push to a
sub-stack that has Nactivei = depth items. Recall that the thread sticks to a sub-stack until it is
not possible to conduct an operation on it. This implies that the extra steps can be taken only
in the states Si such that i(mod depth) = 0, because the thread does not leave a sub-stack before
Nactivei = 0 or N
active
i = depth. In addition, a Push (Pop) can only experience an extra step if the
previous operation was also a Push (Pop).
Given that we are in Si such that i(mod depth) = 0, then the first requirement is to have a
Push as the previous operation. If this is true, then the Push operation hops to another sub-stack ,
which is selected from the remaining set of sub-stacks uniformly at random. At this point, there are
f = idepth − 1 full sub-stacks in the remaining set of sub-stacks. If a full sub-stack is selected from
this set, this leads to another hop and again a sub-stack is selected uniformly at random from the
remaining set of sub-stacks.
Consider a full sub-stack (one of the f), this sub-stack would be hopped if it is queried before
quering the sub-stacks that are empty. There are width− f − 1 empty sub-stacks, thus a hop in this
sub-stack would occur with probability 1/(width−f). There are f such sub-stacks. With the linearity
of expectation, the expected number of hops is given by: f/(width − f) + 1 = width/(width − f).
Which leads to E (Hop|Si, Push) = p×width/(width−f) if i(mod depth) = 0 or E (Hop|Si, Push) =
0 otherwise.
















The bounds for E (Hop|Push) would also hold for E (Hop|Pop). Given that there are K − i
(system is in state Si) empty sub-stacks then there are e = b K−idepthc − 1 sub-stacks whose window
regions are empty, minus the sub-stack that the thread last succeeded on. Using the same arguments
that are illustrated above (replace f with e and p=1-p), we obtain the same bound.
Window only shifts at SK if a Push operation happens and at S0 if a Pop operation happens.
Hence: E (Glo) < 2K+1p+
2
K+1 (1− p). Finally, using E (Extra) = E (Hop) + E (Glo) we obtain the
theorem.
13
2D-Stack Technical Report no. 2018:06, ISSN: 1652-926X
6 Parameter Selection
One of the goals of the 2D-stack is to be tunable in order to regulate the trade-off between accuracy
and performance. In this section, we investigate the impact of the 2D-stack parameters to accuracy
and performance, to come up with the optimal parameter settings.
Empirically, we observed that the contention is inversely proportional to width see Figure 1. As a
simple model, we split the latency of an operation into the contention and contention-free operation
costs, denoted by op = opcont/width + opfree. Based on this rough estimation, the performance
would increase as the contention factor vanishes with the increase of width, but with an asymptote
at 1/opfree. This implies that after some point one can not really gain throughput by increasing the
width although it keeps losing in terms of the accuracy. The situation is a bit more complex because
of the extra steps that the algorithm might take as width increases (See Theorem 6). In this case,
we update the latency of an operation with an additional factor of (O( lnwidthdepth )). Meaning that, after
some point, the gains from the contention factor (limwidth→∞ opcont → 0) might be surpassed by
the extra steps and one would observe a decrease in throughput with the increase of width. This
is counter-intuitive in terms of the trade-off between the accuracy and performance. To keep the
trade-off alive, we turn our attention to the depth parameter (depth relaxation). This parameter
can be used to exploit data locality, which might have a very significant impact on the throughput,
especially in a NUMA setting. Operations done by the same thread in isolation at the same sub-stack
can yield very high throughput.
In Figure 1, the red curve (L1) depicts the case where we only use the width relaxation for all K
(depth = 1). The other curves diverge from that one at some point where we start depth relaxation.
With respect to the performance, we see that it is reasonable to apply width relaxation in the
beginning until width = 4P (P stands for the number of threads), where we obtain enough disjoint
access parallelism. After this parameter saturates, one can continue to relax in the depth dimension
to increase the performance via better locality and fewer extra steps. In terms of accuracy, we
observe that the expected error-distance increases almost linearly with the width parameter, whereas
it increases almost logarithmically with the increase of the depth parameter (4P for k > 200). One
can almost recklessly relax on the depth dimension to exploit the locality.
Now, we target to minimize the window maintenance cost. In a sequential process, this cost is
not significant. But, in a concurrent execution, this could be very costly as each update impose a
cost to all threads. For the small values of the relaxation, the window gets updated more frequently.
For this we optimize the process by tuning the shift parameter. Without having a general solution,
we illustrate the optimal value of the shift parameter where p = 1/2 (p denotes the probability of
a Push), for the sequential process that we consider. We will show that shift = depth/2 (assuming
depth is even) is optimal for E (Glo). Intuitively, these two metrics (Hop and Glo) seems to be
correlated because the window shifts only after the maximum number of hops. And, it is more
probable to observe a window shift in an interval with operations that complete after many extra
steps. We believe, the minimization of E (Glo) would also reduce E (Hop), for all values of p.
Lemma 7 For the Markov chain that is initialized with p = 1/2 and shift, where l = shift×width








(l+1)(K+1−l) , if i < l;
(ii) pili =
l+1
(l+1)(K+1−l) , if l ≤ i ≤ K − l; (iii) pili = K−i+1(l+1)(K+1−l) , if i > K − l.
Proof: In Section 5, we have stated that the stationary distribution exist since the chain is aperiodic
and irreducible for all p and shift. Let (Mi,j)(i,j)∈J0,KK2 denotes the transition matrix for p = 1/2
and shift = depth/2, that is given in Section 5. The stationary distribution vector pil fulfills,



























K . In case, l = K− l,









Based on a symmetry argument, one can observe that, for all l, pili = pi
l
K−i the system can be




2D-Stack Technical Report no. 2018:06, ISSN: 1652-926X
Figure 1: 2D-stack throughput and observed quality for different depth and width configurations
The stationary distribution is unique thus for any pil0, pi
l spans the solution space. We know that∑K
i=0 pi
l
i = 1, starting from pi
l
0 = 1, we obtain and normalize each item by the sum.
Theorem 8 Given p = 1/2, shift = depth/2 minimizes E (Glo), where K = width × depth and
l = width× shift.
Proof: The global barrier is updated either with the transition S0 → Sshift×width generated by a
Pop operation or SK → S(K−shift×width) generated by a Push operation. Therefore, our objective
is to minimize: E (Glo) = ppilK + (1 − p)pil0. With p = 1/2 and the symmetry of the Markov
chain states pili = pi
l
K−i, the objective reduces to minimizing pi
l





−l2+Kl+K+1 . The quadratic function f(l) = −l2+Kl+K+1 has its maximum
value at l = K/2 since f ′(l) = K − 2l = 0. Having l = K/2 = shift×width, we obtain the optimal
value: shift =K/(2width)=depth/2.
7 Experimental Evaluation
We evaluate the performance of our implementations together with other existing stack designs
including; the k-segment relaxed stack [13], Elimination-Stack (elimination) [12] and Treiber-Stack
(treiber) [8]. All experiments run on an Intel Xeon CPU E5-2687W v2 machine with two 8-core
3.40GHz Intel Xeon processors (16 cores, 2 threads per core). We pin one thread per core, filling
one socket at a time up-to 16 threads before we switch to hyper-threading. Two NUMA settings are
tested; intra-socket (1 to 8 threads) and inter-socket (9 to 16 threads). Threads select operations
uniformly at random (i.e. with probability 1/2) from Pop and Push operations. Memory is managed
using SSMEM [9]. To simulate high contention, we put no computational load between operations.
For each experiment, the stack is initialized with 215 items, run for five seconds obtaining an average
of five repeats. Throughput is measured in terms of operations per second, whereas accuracy is
measured in terms of error distance from the LIFO semantics.
To measure the accuracy , we adopt and modify a similar method in [4, 15]. A sequential linked
list is run alongside the stack, for each Push or Pop a simultaneous insert or delete is performed
on the list respectively. Items on the stack are duplicated on the list and can be identified by
their unique labels. Insert operations happen at the head of the list similar to the push whereas
the delete operation searches for the given item deletes it and returns its distance from the head
(error distance). We then calculate the expected error distance for a given experiment run for 5
seconds with 5 repeats. Scalability is tested on relaxation, by changing (weakening continuously)
15
2D-Stack Technical Report no. 2018:06, ISSN: 1652-926X
Figure 2: Randomized algorithms’ throughput and observed quality with increasing sub-stacks
Figure 3: Throughput and observed quality as the k bound for relaxation increases.
relaxation and on concurrency, by changing (increasing) the number of threads for different NUMA
settings. Experiment results are then plotted using logarithmic scales, throughput (solid lines) and
error distance (dotted lines) sharing the x-axis.
Based on the analysis presented in Section 6, that was also confronted by our experimental
observations, we select 4P (P stands for number of threads and width = 4P ) as the transition point,
from width to depth relaxation and shift = depth/2 as the optimal performance configuration for
2D-stack . In Figure 3, we evaluate the performance of all algorithms, that are linearizable with
respect to k-Out-of-Order stack (k-robin,2D-stack and k-segment), at the different relaxation levels.
Randomized algorithms (Random and Random-c2 ) do not exhibit k-Out-of-Order bounds, for that,
they are evaluated with respect to number of sub-stacks in Figure 2. We observe that 2D-stack
consistently outperforms the other algorithms followed by k-robin for both settings where the number
of threads changes from 8 to 16 (NUMA setting). Under low degree of relaxation, 2D-stack avoids
contention by hopping to another sub-stack on a failed CAE. This highly improves performance
compared to k-robin that keeps retrying on the same sub-stack . As the relaxation increases, 2D-
stack combines contention avoidance with locality exploitation, a parameter exclusive to the 2D-stack
design as explained in Section 6. While for the other algorithms the accuracy reduces almost linearly
with the increase in relaxation, 2D-stack maintains good accuracy with width > 4P (k > 200 for
P = 8 and k > 600 for P = 16). At this point, the algorithm switches from width to depth by
16
2D-Stack Technical Report no. 2018:06, ISSN: 1652-926X
Figure 4: Throughput as concurrency (number of threads) increases.
increasing the depth. This change has a smaller negative impact on the accuracy , compared to the
other algorithms. 2D-stack continuously trades off accuracy for throughput by switching between
relaxation dimensions for different relaxation levels. k-segment is mostly affected by the high cost
of maintaining segments coupled with increased number of hops as relaxation increases.
We now configure the algorithms to obtain the maximum throughput performance for both intra
and inter-socket settings, Figure 4. Based on the results that we observed and discussed before
(Figure 2), width for Random and Random-c2 translates to 103 sub-stacks. For k-robin, 2D-stack
and k-segment this translates to a k = 104, Figure 3. Two ”non-relaxed” algorithms elimination
and treiber are also included in the experiment to compare the power of relaxation to improve
performance compared to other strict semantics efficiency improvement techniques. We generally
observe that, 2D-stack is able to maintain the increase in throughput also while increasing the
number of threads, even for the NUMA settings. Under inter-socket setting, Random maintains
almost a constant throughput as we increase concurrency with no throughput increase whereas the
throughput performance for other algorithms drop. As the number of threads increase, Random,
Random-c2 and k-segment maintain almost constant accuracy due to the fixed number of sub-
stacks. k-robin and 2D-stack vary the number of sub-stacks as the number of threads change.
k-robin reduces number of sub-stacks with the increase in number of threads to keep the k-bound,
this improves accuracy but hurts throughput due to the increased contention. As observed, 2D-
stack maintains high throughput also when the number of threads increases for different NUMA
settings. Overall, 2D-stack shows a full control to leverage the semantics relaxation to reach very
high throughput in a continuous way. A property that is missing in other solutions.
8 Conclusion
The aim of this work is to design an efficient lock-free stack algorithmic that can continuously
relax semantics to improve throughput through exploiting disjoint access parallelism and locality.
17
2D-Stack Technical Report no. 2018:06, ISSN: 1652-926X
We have achieved this through our two dimension relaxation stack design that exploits disjoint
access parallelism in one dimension and locality in the other. The 2D-stack , uses an efficient widow
based synchronization technique, that manages to keep the relaxation bound without receding the
significant performance achieved through locality. 2D-stack significantly outperformed all the other
stack implementations due to its capability to switch relaxation dimensions leading to a monotonic
trade of accuracy for better performance. In addition to 2D-stack , we have implemented and tested a
set of other possible relaxed stack designs. Together with step complexity analysis, we also provided
tight accuracy bounds for two algorithms presented in this report including 2D-stack .
References
[1] Yehuda Afek, Guy Korland, Maria Natanzon, and Nir Shavit. Scalable producer-consumer
pools based on elimination-diffraction trees. Euro-Par 2010-Parallel Processing, pages 151–162,
2010.
[2] Yehuda Afek, Guy Korland, and Eitan Yanovsky. Quasi-linearizability: Relaxed consistency
for improved concurrency. In International Conference on Principles of Distributed Systems,
pages 395–410. Springer, 2010.
[3] Dan Alistarh, Justin Kopinsky, Jerry Li, and Giorgi Nadiradze. The power of choice in priority
scheduling. In Proceedings of the ACM Symposium on Principles of Distributed Computing,
PODC ’17, pages 283–292, New York, NY, USA, 2017. ACM.
[4] Dan Alistarh, Justin Kopinsky, Jerry Li, and Nir Shavit. The spraylist: A scalable relaxed
priority queue. ACM SIGPLAN Notices, 50(8):11–20, 2015.
[5] Hagit Attiya, Rachid Guerraoui, Danny Hendler, Petr Kuznetsov, Maged M. Michael, and
Martin Vechev. Laws of order: Expensive synchronization in concurrent algorithms cannot be
eliminated. SIGPLAN Not., 46(1):487–498, January 2011.
[6] Yossi Azar, Andrei Z. Broder, Anna R. Karlin, and Eli Upfal. Balanced allocations. SIAM J.
Comput., 29(1):180–200, September 1999.
[7] Gal Bar-Nissan, Danny Hendler, and Adi Suissa. A dynamic elimination-combining stack algo-
rithm. CoRR, abs/1106.6304, 2011.
[8] Thomas J. Watson IBM Research Center and R.K. Treiber. Systems Programming: Coping
with Parallelism. Research Report RJ. International Business Machines Incorporated, Thomas
J. Watson Research Center, 1986.
[9] Tudor David, Rachid Guerraoui, and Vasileios Trigonakis. Asynchronized concurrency: The se-
cret to scaling concurrent search data structures. SIGARCH Comput. Archit. News, 43(1):631–
644, March 2015.
[10] Elad Gidron, Idit Keidar, Dmitri Perelman, and Yonathan Perez. Salsa: scalable and low
synchronization numa-aware algorithm for producer-consumer pools. In Proceedings of the
twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures, pages
151–160. ACM, 2012.
[11] Andreas Haas, Michael Lippautz, Thomas A. Henzinger, Hannes Payer, Ana Sokolova,
Christoph M. Kirsch, and Ali Sezgin. Distributed queues in shared memory: Multicore perfor-
mance and scalability through quantitative relaxation. In Proceedings of the ACM International
Conference on Computing Frontiers, CF ’13, pages 17:1–17:9, New York, NY, USA, 2013. ACM.
[12] Danny Hendler, Nir Shavit, and Lena Yerushalmi. A scalable lock-free stack algorithm. Journal
of Parallel and Distributed Computing, 70(1):1–12, 2010.
18
2D-Stack Technical Report no. 2018:06, ISSN: 1652-926X
[13] Thomas A Henzinger, Christoph M Kirsch, Hannes Payer, Ali Sezgin, and Ana Sokolova. Quan-
titative relaxation of concurrent data structures. In ACM SIGPLAN Notices, volume 48, pages
317–328. ACM, 2013.
[14] Michael Mitzenmacher. The power of two choices in randomized load balancing. IEEE Trans-
actions on Parallel and Distributed Systems, 12(10):1094–1104, 2001.
[15] Hamza Rihani, Peter Sanders, and Roman Dementiev. Brief announcement: Multiqueues:
Simple relaxed concurrent priority queues. In Proceedings of the 27th ACM symposium on
Parallelism in Algorithms and Architectures, pages 80–82. ACM, 2015.
[16] Nir Shavit and Gadi Taubenfeld. The computability of relaxed data structures: Queues and
stacks as examples. Distrib. Comput., 29(5):395–407, October 2016.
[17] Nir Shavit and Dan Touitou. Elimination trees and the construction of pools and stacks: pre-
liminary version. In Proceedings of the seventh annual ACM symposium on Parallel algorithms
and architectures, pages 54–63. ACM, 1995.
[18] Nir Shavit and Asaph Zemach. Combining funnels: a dynamic approach to software combining.
Journal of Parallel and Distributed Computing, 60(11):1355–1387, 2000.
[19] H˚akan Sundell, Anders Gidenstam, Marina Papatriantafilou, and Philippas Tsigas. A lock-free
algorithm for concurrent bags. In Proceedings of the Twenty-third Annual ACM Symposium on
Parallelism in Algorithms and Architectures, SPAA ’11, pages 335–344, New York, NY, USA,
2011. ACM.
[20] Edward Talmage and Jennifer L. Welch. Relaxed Data Types as Consistency Conditions, pages
142–156. Springer International Publishing, Cham, 2017.
[21] Martin Wimmer, Jakob Gruber, Jesper Larsson Tra¨ff, and Philippas Tsigas. The lock-free k-lsm
relaxed priority queue. In ACM SIGPLAN Notices, volume 50, pages 277–278. ACM, 2015.
19
