A new and five older Concurrent Memory Reclamation Schemes in Comparison
  (Stamp-it) by Pöter, Manuel & Träff, Jesper Larsson
A new and five older Concurrent Memory Reclamation Schemes in
Comparison (Stamp-it)
Manuel Po¨ter
TU Wien, Faculty of Informatics
Vienna, Austria
manuel@manuel-poeter.at
Jesper Larsson Tra¨ff
TU Wien, Faculty of Informatics
Vienna, Austria
traff@par.tuwien.ac.at
December 19, 2017
Abstract
Memory management is a critical component in almost all shared-memory, concurrent data
structures and algorithms, consisting in the efficient allocation and the subsequent reclamation
of shared memory resources. This paper contributes a new, lock-free, amortized constant-time
memory reclamation scheme called Stamp-it, and compares it to five well-known, selectively
efficient schemes from the literature, namely Lock-free Reference Counting, Hazard Pointers,
Quiescent State-based Reclamation, Epoch-based Reclamation, and New Epoch-based Recla-
mation. An extensive, experimental evaluation with both new and commonly used benchmarks
is provided, on four different shared-memory systems with hardware supported thread counts
ranging from 48 to 512, showing Stamp-it to be competitive with and in many cases and aspects
outperforming other schemes.
1 Introduction
Efficient, dynamic memory management is at the heart of many sequential and parallel algorithms,
and consist in the allocation of pieces of memory and the subsequent, safe reclamation of these
pieces when they are no longer in use. In parallel and concurrent, lock- and wait-free algorithms,
the reclamation step is highly non-trivial since more than one thread may be referencing and using
an allocated piece of memory unbeknownst to other threads: It cannot be given back to the system
or thread-local heap before it has been ascertained that no threads will possibly access any data in
this memory anymore.
Naturally, there has been a substantial amount of work on memory reclamation for concurrent
algorithms, see, e.g., [2–4, 6–9, 11, 15, 18, 19, 22, 28, 30]. All of these schemes have their merits and
(performance) issues. One drawback shared by them all, except for reference counting schemes, is
that they need to scan references from all threads in order to reclaim possibly no longer referenced
memory pieces. A main motivation of this work is to overcome this bound.
The contribution of this paper is a new lock-free reclamation scheme, called Stamp-it, which
is compared qualitatively and experimentally to five well-known and, depending on circumstances,
well performing current schemes. Reclamation in Stamp-it is done in amortized constant time
per reclaimed memory block; no references are scanned unless they can be reclaimed. All tested
schemes have been (re)implemented in C++; the full source code is available at http://github.
1
ar
X
iv
:1
71
2.
06
13
4v
1 
 [c
s.D
C]
  1
7 D
ec
 20
17
com/mpoeter/emr. The experimental evaluation is done on four architecturally different systems
with large numbers of hardware supported threads, ranging from 48 up to 512. We use standard
benchmarks, as well as a new benchmark designed to study memory consumption by reclaimable
but unreclaimed memory. On these benchmarks and machines Stamp-it compares favorably and in
many cases and aspects significantly outperforms the competing schemes.
A contiguous piece of memory allocated from the system heap for use in the concurrent algorithm
and possibly shared between the threads is called a node. Efficient allocation and deallocation is
a complex topic on its own and a number of scalable memory managers have been published
[5, 12, 14, 23], but is outside the scope of this work. We mainly use the allocator from the C++
standard library; except on Solaris where we use jemalloc as described later.
Nodes may store additional meta-information that is normally not seen by the application;
additional meta-information needed by the different reclamation schemes will be discussed. We use
p to denote the number of threads.
A general purpose reclamation scheme allows the eventually reclaimed memory of nodes to be
freely reused at a later time, regardless of how and in which data structure the allocated nodes
were used. Not all reclamation schemes have this property, e.g., [30,32] do not allow general reuse
of reclaimed nodes, [6, 15] have to be tailored to each data structure and [8, 9] require the data
structure to be in a special, normalized form. A general purpose scheme should be non-intrusive,
requiring no or little changes in the code. A way of achieving this is to rely on a standard interface
as those proposed for C++ [29]. A reclamation scheme should be fast, both in use and maintenance
of references to shared nodes, as well as in the actual reclamation. It should require little memory
overhead, avoid typical performance issues like false sharing and should not prevent data structures
from using typical patterns found in lock-free programming like borrowing some bits from a pointer.
Reclaimability of nodes should be detected fast to reduce memory pressure. Robustness against
crashes, and bounds on the amount of memory blocked by crashed threads are desirable. Lock-
freedom should allow for good scalability; wait-freedom would be desirable, but not many schemes
actually provide this.
All lock- and wait-free algorithms rely on hardware supported atomic operations. We consider
only solutions that use hardware atomics available in standard processors like fetch-and-add (FAA)
and single-word compare-and-swap (CAS). Solutions requiring non-standard double-word compare-
and-swap (as in, e.g., [11]) will either be non-portable or require expensive emulations. We also
ruled out solutions that have to be tailored to specific data structures (like [6, 15]) or that require
hardware or operating system specific features like transactional memory (e.g., [2]) or POSIX
signals (e.g., [3, 7]). The aim was to create a portable, fully C++ standard conform and platform
independent implementation. Our implementation is mature beyond a simple proof of concept,
and is applicable for real-life applications and works with arbitrary numbers of threads that can be
started and stopped arbitrarily.
Based on the above discussion, we have implemented Lock-free Reference Counting (LFRC) [32],
Hazard Pointers (HP) [22], Quiescent State-based Reclamation (QSR), Epoch-based Reclamation
(ER) [13], and New Epoch-based Reclamation (NER) [18]. Hart et al. [18] used the same selection
of schemes in their study, and we wanted to repeat their experiments with our own implementations
on different platforms and at a larger scale.
2
2 Memory Reclamation Schemes
We first describe in more detail five memory reclamation schemes, selected according to the desired
criteria described above. This provides the basis for introducing Stamp-it, and qualitatively compare
it to the other schemes. Although fitting our criteria, we have at this point in time not considered
the scheme DEBRA [7], but may add results from this at a later time. Another very recent scheme
called “Interval based reclamation” (paper to appear) is likewise not discussed.
2.1 Lock-free Reference Counting
Reference counting is a well known concept that has been used for decades. The first reclamation
scheme for lock-free data structures based on reference counting was presented by John D. Val-
ois [32]. The original proposal contained race conditions that were discovered and corrected by
Maged M. Michael and Michael L. Scott [24].
In reference counting, each node is equipped with an integer field that tracks the number of
references to the node. Each thread is responsible for updating this reference counter accordingly—
incrementing it for each new reference, decrementing it for every dropped reference. The increment
is implemented using a simple FAA operation. The decrement, however, is more complicated and
has to be implemented using a CAS operation; the reason for this is explained below. When a
reference counter drops to zero there are no more references to this node and it can therefore be
reclaimed. A general problem with reference counting schemes is that they can only reclaim nodes
in acyclic structures; circular structures are vulnerable to memory leakage.
Although lock-free reference counting (LFRC) avoids locks, it cannot guarantee an upper bound
on the amount of memory consumed by removed nodes, since every thread can hold an arbitrary
number of references to nodes. It has been shown by Michael [22] and Hart et al. [18] that reference
counting can incur large performance overheads that often makes lock-free data structures perform
worse than their lock-based counterparts.
LFRC is not a general reclamation scheme where reclaimed memory can be reused arbitrarily.
However, LFRC can be used in situations where reclaimed nodes are reused in the same data
structure. As the following example shows, the scheme expects a node’s reference counter to be
available indefinitely, such that it is possible to update the reference count on a potentially reclaimed
node without corrupting the data structure. This would not be the case if the memory was reused
in different data structures.
Assume a thread wants to move from one node to the next in a linked list. With LFRC, it has
has to perform the following steps:
1. Read the reference to the next node.
2. Increase the reference counter of the next node.
3. Reread the reference to the next node and check whether it has changed in the meantime.
(a) If it has changed, decrease the reference counter of the next node, drop the reference
and start again at Step 1.
(b) Otherwise the thread has a safe reference to the next node.
In this sequence there is a race condition between Step 1 (reading the reference) and Step 2
(incrementing the reference counter). It could happen that between these two steps the node’s
3
reference counter drops to zero (due to another thread releasing its reference) such that the node
becomes reclaimable. In Step 2, the thread would then increment the reference counter of a po-
tentially reclaimed node. To overcome the race, Step 3 rereads the reference to the next node to
ensure that it has not changed in the meantime. In case it has changed the thread has to restart
the whole procedure.
In order to reuse reclaimed nodes in the same data structure, a special, global free-list is
maintained. When a thread wants to decrement the reference count it checks whether it is about to
drop the last reference and if that is the case tries to set a “claim bit” in a single atomic compare-
and-swap operation. The thread that successfully sets the claim bit can safely push the node to
the free-list.
There are several other proposals for systems based on this reference counting scheme. Detlefs
et al. [11] allow changing the node’s type upon reclamation but require a double-compare-and-swap
(DCAS) operation which is usually not supported by current CPUs. Another scheme proposed by
Sundell [30] is wait-free, but the number of threads has to be known in advance.
2.2 Hazard pointers
Hazard Pointer based reclamation (HP) was introduced by M. Michael [22]. This scheme is some-
times also referred to as safe memory reclamation (SMR) [13,21,31]. HP is based on the observation
that, in the vast majority of lock-free data structures, threads hold only a small number of refer-
ences that may later be used without further validation. The main idea is to associate a number
of single-writer, multi-reader shared pointers, called hazard pointers, with each thread to operate
on the associated nodes. Each thread has k hazard pointers (depending on the actual algorithm
and data structure, but k must be known a priori). With p threads, H = pk hazard pointers are
needed in total.
When a thread wants to access a shared node, it stores the node’s reference in one of its unused
hazard pointers. This is the way to signal to the other threads that this thread is using this
particular node and that it is therefore not safe to reclaim it. When the thread no longer needs the
node it simply resets the according hazard pointer to null.
Nodes that have been removed from the data structure and need to be reclaimed (called retired
nodes in [22]) are maintained in thread-local lists. Whenever the size of a thread’s list reaches some
chosen threshold R, the thread tries to reclaim the nodes from the list. Increasing R amortizes
reclamation overhead across more nodes, but increases memory usage; if R is larger than H by some
amount proportional to H the amortized per-node processing time is constant, but this can cause
performance issues with large numbers of threads as will be shown in Section 4.4. To determine
whether it is safe to reclaim a certain node, the thread scans the hazard pointers of all other threads
to check if one of them is currently using it. Since each thread has k hazard pointers and can hold
R removed elements in its private list, a crashed thread can prevent only k + R removed elements
from being reclaimed. The HP scheme thus bounds the amount of memory which can be occupied
by removed nodes, even in the presence of thread failures.
HP can be extended to support an arbitrary number of hazard pointers per thread as explained
by Michael [22], but unfortunately, this change destroys the two important properties that set HP
apart from other reclamation schemes: Constant processing time per element as well as the upper
bound on unreclaimable nodes. Aghazadeh et al. [1] gave an improved version of HP that reduces
the number of comparisons per scan to one, at the cost of increasing the amount of time between
node removal and node reclamation.
4
A proposal to add hazard pointers to the C++ standard library has been brought up by Michael
and Wong [26], and is currently receiving attention.
2.3 Quiescent state based reclamation
Quiescent State based Reclamation (QSR) is typically used to implement read-copy-update (RCU)
schemes [10, 20]. It relies on the concept of a grace period which is a time interval [a, b] such that,
after time b, all nodes removed before time a can safely be reclaimed. QSR uses quiescent states to
detect grace periods. A quiescent state for some thread T is a state in which T holds no references
to shared nodes. In particular, T holds no references to any shared nodes which have been removed
from a lock-free data structure. A time interval in which every thread of the system has passed
through at least one quiescent state is therefore a grace period.
A typical way to implement QSR is by using a non-blocking fuzzy barrier [16]. The fuzzy-barrier
is used to protect the code that performs the reclamation. The threads try to enter the barrier and
reclaim retired nodes when they pass through a quiescent state. In order to determine whether all
threads have reached the barrier (i.e., whether they went through at least one quiescent state) all
threads have to be checked. This incurs a performance overhead linear in the number of threads.
2.4 Epoch based reclamation
Epoch-based Reclamation (ER), introduced by Fraser [13], also relies on grace periods. Nodes that
have been removed from data structures are kept in thread-local limbo lists that hold the references
to the nodes until it is safe to reclaim them. The scheme uses three epochs and each of the epochs
has an associated limbo list.
In ER, the programmer has to identify critical regions in which threads are allowed to access
shared objects. These regions have to be entered and left explicitly. A global epoch count is used
to determine when no stale references exist to any object in a limbo list.
Every thread has a flag that indicates whether this thread is currently in a critical region as
well as a local epoch count that identifies the epoch in which it currently executes (in case it is
inside a critical region). The thread’s local epoch count may lag at most one epoch behind the
global epoch. Each time a thread enters a critical region, it sets the flag and observes the current
epoch, i.e., it updates its local epoch to match the global epoch. A thread that removes a node
from a data structure places this node on the current limbo list that is associated with the current
epoch.
After some predetermined number of critical region entries, a thread will attempt to update
the global epoch. This succeeds only if all threads in a critical region have already observed the
current epoch which can again be detected with a fuzzy barrier. In that case, the limbo list that
was populated two epochs ago can safely be reclaimed and the list itself recycled and reused for
the next epoch. Thus only three epochs (and limbo lists) are required in total.
To determine whether all threads have observed the new global epoch, all thread-local epochs
have to be checked. This incurs a performance overhead linear in the number of threads.
2.5 New epoch based reclamation
New Epoch-based Reclamation (NER) is an extension to ER proposed by Hart et al. [18]. The
original description of ER defines a critical region around every operation. However, entering a
5
Threads
time
T1
T2
T3
Global stamp
(initial=0)
Thread stamps
t1
update
1
stamp1=0
stamp2=-
stamp3=-
t2
update
2
stamp1=0
stamp2=1
stamp3=-
remove n1
t3
read
2
t4
update
3
stamp1=0
stamp2=1
stamp3=2
remove n2
t5
read
3
t6
stamp1=0
stamp2=-
stamp3=2
t7
stamp1=-
stamp2=-
stamp3=2
t8
stamp1=-
stamp2=-
stamp3=-
Figure 1: Example showing global and local stamps of an execution with Stamp-it. Thick lines mark critical
regions.
critical region requires a sequentially consistent memory fence and such operations can be expensive.
This is necessary to guarantee that another thread that tries to update the global epoch actually
sees the new value and therefore recognizes that this thread is inside a critical region. Without this
guarantee, a race condition can occur, where the global epoch gets updated which in turn allows
a node to be freed even though it is still in use by some thread, just because the update of this
thread’s critical region flag was not noticed by the thread that updated the global epoch. In ER
every single operation on some lock-free data structure is encapsulated in its own critical region,
thus every such operation requires a memory fence.
Hart et al. [18] showed this overhead for every single operation to be very significant. As
a remedy, NER allows critical regions to cover several operations. For example, when a group of
operations on some data structure has to be performed together, the critical region is entered before
the first operation and left after the last one, effectively expanding the region over all operations
and thus distributing the overhead for the region entry over the whole group of operations. The
drawback is that due to the larger regions the global epoch might be updated less frequently which
could delay reclamation and thus increase memory pressure.
2.6 Stamp-it
We now introduce our new scheme, Stamp-it. It is conceptually similar to NER and therefore
provides the many of same properties As in ER/NER, the programmer has to define critical regions
that are entered and left explicitly. A thread is only allowed to access shared objects inside such
regions.
When a thread enters a critical region it increments a global stamp using an atomic fetch-and-
add (FAA) and stores the returned stamp in a thread-local data structure visible to the other
threads. By setting the stamp in the data structure, the thread also signals to other threads that
it is now inside a critical region. When a thread retires a node for reclamation it takes the current
value of the global stamp, stores it in a special field of the node, and appends the node to the end
of a thread-local retire list. The node can be reclaimed as soon as all the threads that were inside
a critical region at the time the node was added to the retire list have left their respective critical
region.
When a thread leaves a critical region, it resets its stamp and tries to reclaim retired nodes
from the local retire list. For that, it must determine the lowest stamp value of threads that are
6
inside a critical region, i.e., the stamp value of the thread that has entered a critical region at the
earliest. Any node in the retire list that has a stamp value that is less or equal to this lowest stamp
can safely be reclaimed. Since retired nodes are appended to the end of the retire list they are
strictly ordered by their stamp value. Reclamation starts with the node with the lowest stamp and
can stop as soon as a node with a stamp higher than the current lowest stamp is found. No time
is wasted on nodes that cannot yet be reclaimed. Figure 1 illustrates this.
The initial value of the global stamp is zero. When thread T1 enters its critical region at time
t1 it increments the global stamp and stores the old value in its local stamp. The same happens
when T2 enters its critical region at t2 and T3 at t4. At t3, thread T1 removes the node n1 from
some data structure and marks it for reclamation. To that end, it reads the current value of the
global stamp, which is two since time instant t2, stores this value in the node and adds it to the
local retire-list. The node can be reclaimed once all threads that were in a critical region at the
time the node was marked (t3) have left their respective critical region. This can be determined
by checking if any thread in a critical region has a local stamp value that is less than the node’s
stamp. For the node n1 this would be t7 and for node n2 it would be t8.
A straightforward implementation of this scheme is quite simple, but will have runtime com-
plexity linear in the number of threads since all threads have to be scanned in order to determine
the lowest stamp. To improve this, we use a data structure that supports the following operations
efficiently:
1. Add an element and assign a stamp to it (push). Stamps have to be strictly increasing, but
not necessarily consecutive.
2. Remove a specific element, return true if this element was the one with the lowest stamp
(remove).
3. Get the highest stamp ever assigned to an element.
4. Get the lowest stamp of all elements.
In addition, a global retire-list is introduced. It is used to collect nodes that could not be reclaimed
when their owning thread left its critical region. The responsibility to reclaim these nodes is deferred
to the “last” thread as explained below.
Stamp-it uses this data structure as follows. Upon entering a critical region the thread adds
itself to the data structure, and gets a new stamp value, defining a total order on the entries to the
critical regions.
When a thread retires a node, it requests the highest stamp from the data structure, stores it
in the node and appends the node to the end of its local retire-list. If this pushes the number of
entries in the local retire-list over a certain threshold it immediately performs a reclaim operation.
The reclaim operation requests the lowest stamp from the data structure and reclaims all entries
from the local retire-list with a stamp value less than the requested one.
Upon leaving a critical region the thread removes itself from the data structure and performs
a reclaim operation on the local retire-list. If the remove operation returns false and the number
of nodes in the local retire-list exceeds some threshold, the thread pushes all remaining entries to
the global retire-list as an ordered sublist. If the remove operation returns true, i.e., the thread
had the smallest stamp and was therefore “lagging behind” the most and blocking reclamation, it
will perform a reclaim operation on the global retire-list. In contrast to the local retire-list, the
7
global retire-list is not totally ordered and therefore does not seem to provide the same runtime
guarantees. However, since it is organized as a list of sorted sublists, each sublist needs to be
scanned only up to the node which has a stamp that is larger than or equal to the lowest stamp
returned. Therefore, if we maintain additional links from sublist to sublist, the resulting total
runtime is O(n + m) where n is the total number of reclaimable nodes and m is the number of
ordered sublists in the global retire-list.
We implemented the data structure as a lock-free doubly-linked list based on the proposal by
Sundell and Tsigas [31]. This data structure maintains sentinel head and tail nodes which are used
to store the highest and lowest stamp values, respectively. The push operation first increments
the head’s stamp using an atomic fetch-and-add (FAA), stores the returned value in the node it is
currently inserting and then tries to insert the node into the linked list, right after the head, using
an atomic compare-and-swap (CAS) operation. The remove operation unlinks the node from both
directions, and returns true if the node was last, i.e., the tail’s predecessor.
Every thread holds a thread-local control block that is used as a node in this list. A thread that
enters a critical region simply calls push with its node. Thus, the linked list in direction from tail
(smallest stamp) to head (largest stamp) defines the order in which the threads have entered their
respective critical regions. When a thread leaves its critical region it calls remove. If the return
value is true, it first updates tail’s stamp to match the value of the new predecessor, and then it
performs a reclaim operation on its local retire-list as well as the global retire-list. Otherwise, the
thread performs a reclaim operation on its local retire-list, and if the number of remaining nodes
exceeds some threshold, it moves the remaining local list to the global retire-list.
The algorithm is clearly lock-free. In the absence of contention, entering a critical section takes
constant time, and leaving a critical section time proportional to the number of reclaimable nodes.
The time per node is therefore amortized constant. In Section 4.1, we experimentally show that
even under load, the number of retry iterations is small (constant).
2.7 DEBRA
DEBRA (Distributed Epoch Based Reclamation), introduced by Brown [7], is an adaptation of
ER. The main difference is that its operations perform in only O(1) steps. This is achieved by
incrementally scanning the flags of all the other threads when entering a critical region. With each
critical region entry only one thread is checked, thus the cost of scanning n threads is amortized
over n enter operations. However, it still has to scan all threads.
We included DEBRA in this section for completeness, but we have not yet implemented it, and
therefore we did not consider it in the experimental analysis. However, this is planned for future
work; the results in this paper will be updated accordingly.
3 Experimental setup
We evaluate the described memory reclamation schemes with respect to various factors. The tests
are set up similarly to those performed by Hart et al. [18] and we also repeated most of those
analyses. This section shows only a subset of the results, the remaining results can be found in
the Appendix. All results including the raw data and the scripts that were used are available on
GitHub (https://github.com/mpoeter/emr-benchmarks). This section provides details on all
aspects of our experiments.
8
Table 1: The four machines used in the experimental evaluation
AMD Intel XeonPhi Sparc
CPUs 4x AMD Opteron(tm)
Processor 6168
8x Intel(R) Xeon(R)
CPU E7 - 8850 @ 2.00GHz
1x Intel(R) Xeon Phi(TM)
coprocessor x100 family
4x SPARC-T5
Cores/CPU 12 10 61 16
SMT - 2 4 8
Hardware Threads 48 160 244 512
Memory 128 GB 1 TB 16 GB 1 TB
OS Linux 4.7.0-1-amd64 #1
SMP
Debian 4.7.6-1 (2016-10-
07)
x86 64 GNU/Linux
Linux 4.7.0-1-amd64 #1
SMP
Debian 4.7.6-1 (2016-10-
07)
x86 64 GNU/Linux
Linux 2.6.38.8+mpss3.8.1
#1 SMP
Thu Jan 12 16:10:30 EST
2017
k1om GNU/Linux
SunOS 5.11 11.3
sun4v sparc sun4v
Compiler gcc version 6.3.0 20170205
(Debian 6.3.0-6)
icpc version 17.0.1 (gcc
version 6.0.0 compatibil-
ity)
icpc version 17.0.1 (gcc
version 5.1.1 compatibil-
ity)
gcc version 6.3.0 (GCC)
3.1 Implementation
The tests, data structures and reclamation schemes have been implemented in C++11/14, using
an adapted version of the interface proposed by Robison [29]. This proposal introduces the concept
of a guard ptr which allows a thread to get a safe reference to a shared node, i.e., the guard ptr
ensures that the node cannot be reclaimed as long as the guard ptr instance exists. Extending
this interface, we introduce the concept of a region guard. This is used in the implementations
of NER, QSR and Stamp-it to associate critical regions with the scope of region guard instances.
This reduces the costs of guard ptr instances created inside the scope of a region guard. In NER
and Stamp-it, a critical region thus spans the lifetime of any guard ptr or region guard instance.
Since QSR is considered to be inside a critical region at all times, each thread can go through a
quiescent state once the last guard ptr or region guard instance is released.
3.2 Benchmarks
We tested the reclamation schemes on a (1) queue, a (2) linked-list and a (3) hash-map. The
queue is based on Michael and Scott’s design [25], the linked-list and hash-map are based on
Michael’s improved version [21] of Harris’ list-based set [17]. The List and Queue benchmarks
have a parameter to control the number of elements initially in the data structures. For the
List benchmark the key range is calculated to be twice the initial list size. The probabilities of
inserting and removing nodes are equal, keeping the size of the list and queue data structures
roughly unchanged throughout a given run. The List benchmark has a workload parameter that
determines the fraction of updates (remove/insert) of the total number of operations. A workload of
0% corresponds to a search-only use case, while a workload of 100% corresponds to an update-only
use case.
Our experiments are throughput oriented in the following sense. The main thread spawns p
child threads and starts a timer. Every child thread performs operations on the data structure
under scrutiny until the timer expires. Upon timer expiry the child threads are stopped and the
parent thread calculates the average execution time per operation by summing up the runtime of
each child thread and its number of performed operations.
Each benchmark was performed with 30 trials, with eight seconds runtime per trial. Most of the
benchmarks focus on performance, and calculate the average runtime per single operation for each
trial. Each thread calculates its average operation runtime by dividing its active, overall runtime
by the total number of operations it performed. The total average runtime per operation is then
calculated as the average of these per-thread runtime values.
9
It is important to note that all 30 trials were performed sequentially within the same process.
This is especially important in case of the HashMap benchmark as the hash-map is retained over
the whole runtime. This means that a result calculated in the first trial can be found in the hash-
map and reused in a subsequent trial. For this reason, performance will be worse at the beginning,
while the hash-map is in the “warm up phase”, but will improve over time when it becomes filled
and more items can be reused. But also in the other benchmarks, it is possible that previous trials
have impact on later ones, e.g., due to an already initialized memory manager. It was a deliberate
design decision to run all trials in the same process as this might more closely reflect a real world
situation.
The Queue and List benchmarks are synthetic micro-benchmarks, exactly as used by Hart et
al. [18]. The HashMap benchmark is intended to highlight other properties of the reclamation
schemes. It mimics the calculation in a complex simulation where partial results are stored in a
hash-map for later reuse. These partial results are relatively large, so in order to limit the total
memory usage the number of entries in the hash-map is kept below some threshold by evicting old
entries using a simple FIFO policy. The resulting benchmark has the following properties:
• there is no upper bound on the number of nodes that are intentionally blocked from recla-
mation.
• the average lifetime of each guard ptr is relatively long.
• the memory footprint of each node is significant, putting additional pressure on the reclama-
tion scheme to reclaim nodes efficiently and in a timely manner.
Since there is no upper bound on the number of nodes that need to be available for a thread, the
standard HP scheme is insufficient; thus an extended version has to be used that allows a dynamic
number of hazard pointers as explained by Michael [22]. The number of buckets in the hash-map is
2048, the maximum number of entries in the hash-map is 10000. There are 30000 possible partial
results and every thread has to calculate or reuse 1000 partial results per “simulation”. The size
of a partial result is 1024 bytes.
Last but not least, the GuardPtr benchmark is used to measure the base cost/overhead of
creating and destroying guard ptr instances. Each thread repeatedly creates a guard ptr on a
shared node and immediately destroys it; no other operations are performed.
3.3 Environment
We executed our tests on four machines with different (micro)architectures. Their respective spec-
ifications are shown in Table 1. These machines all have a relatively large number of cores and
hardware supported threads, allowing us to run our experiments at a scale not found in most prior
studies. We did not experiment with oversubscribed cores.
On Sparc we used jemalloc [12] since in Solaris the libc implementation of malloc and free
uses a global lock. We did not use libumem (a scalable memory manager that is part of all newer
Solaris versions1), because we ran into some sporadic but severe performance drops when running
with a very large number of threads (> 200). We suspect these issues to be caused by large numbers
of cross-thread deallocations. As alternatives we tried Hoard [5] and jemalloc, but Hoard showed
similar symptoms as libumem while jemalloc did not.
1https://blogs.oracle.com/ahl/number-11-of-20:-libumem
10
0
10
20
30
1 2 4 8 16 24 32 48
threads
m
e
a
n
 it
er
a
tio
ns AMD
0
10
20
1 2 4 8 16 24 32 48 64 80 96 128 160 192 256 384 512
threads
m
e
a
n
 it
er
a
tio
ns Sparc
0
10
20
30
40
1 2 4 8 16 24 32 48 64 80 96 128 160
threads
m
e
a
n
 it
er
a
tio
ns Intel
0
50
100
1 2 4 8 16 24 32 48 64 80 96 128 160 192 244
threads
m
e
a
n
 it
er
a
tio
ns XeonPhi
push remove_prev remove_next
Figure 2: Mean number of iterations for the respective operations in the GuardPtr benchmark.
0.00
0.25
0.50
0.75
1.00
1 2 4 8 16 24 32 48
threads
m
e
a
n
 it
er
a
tio
ns AMD
0.00
0.25
0.50
0.75
1.00
1 2 4 8 16 24 32 48 64 80 96 128 160 192 256 384 512
threads
m
e
a
n
 it
er
a
tio
ns Sparc
0.00
0.25
0.50
0.75
1.00
1 2 4 8 16 24 32 48 64 80 96 128 160
threads
m
e
a
n
 it
er
a
tio
ns Intel
0.00
0.25
0.50
0.75
1.00
1 2 4 8 16 24 32 48 64 80 96 128 160 192 244
threads
m
e
a
n
 it
er
a
tio
ns XeonPhi
push remove_prev remove_next
Figure 3: Mean number of iterations for the respective operations in the Queue benchmark.
ER/NER try to advance the epoch every 100 critical section entry. In the List and Queue
benchmarks, a region guard spans 100 benchmark operations, so this is the size of the critical
region for QSR, NER and Stamp-it. QSR executes the fuzzy barrier when it exits the critical
region. In HPR the local retire list is scanned once its threshold is exceeded; the threshold is
100 +
∑p
i=0Ki ∗ 2 where p is the number of threads and Ki is the number of hazard pointers for
the thread with index i
4 Experimental Results
In this section we present the results of a subset of our experiments. First, we show that Stamp-
it meets the expectations with respect to the expected average runtime complexity. We then
present thread scalability results, and finally investigate the reclamation efficiency for all described
schemes. All results including the data are available on GitHub (https://github.com/mpoeter/
emr-benchmarks).
11
0.0
0.5
1.0
1.5
2.0
1 2 4 8 16 24 32 48
threads
m
e
a
n
 it
er
a
tio
ns AMD
0.0
0.5
1.0
1.5
2.0
1 2 4 8 16 24 32 48 64 80 96 128 160 192 256 384 512
threads
m
e
a
n
 it
er
a
tio
ns Sparc
0
1
2
3
4
5
1 2 4 8 16 24 32 48 64 80 96 128 160
threads
m
e
a
n
 it
er
a
tio
ns Intel
0
1
2
1 2 4 8 16 24 32 48 64 80 96 128 160 192 244
threads
m
e
a
n
 it
er
a
tio
ns XeonPhi
push remove_prev remove_next
Figure 4: Mean number of iterations for the respective operations in the HashMap benchmark.
4.1 Stamp-it base performance
We first give an experimental analysis of the effective average number of steps for the data structure
operations in Stamp-it. To that end, we use thread-local performance counters to keep track of the
number of retries due to failed CAS operations in push and remove, thus allowing to calculate the
average number of iterations per operation. Since the data structure is based on a doubly linked list,
the remove operation builds on two other operations remove from prev and remove from next to
remove the node from both directions [31].
The benchmarks were run as described in Section 3, but instead of average time per operation,
the average number of iterations in push, remove from prev and remove from next has been
measured. The results for the various benchmarks are shown in Figures 2, 3 and 4. The results
for the List benchmark are omitted because they a qualitatively similar to those from the Queue
benchmark. However, they can be found in Appendix A.2.
The GuardPtr benchmark is the most interesting, since this is kind of a “stress test”, i.e., it
simulates the worst case scenario where all threads just insert and immediately remove themselves
from the data structure. Essentially, this scenario tests the scalability of the data structure itself.
As can be seen in Figure 2, the average number of iterations is less than the number of threads in
all cases, suggesting that even in this worst case scenario the expected average runtime complexity
is O(p).
Interestingly, the behavior differs significantly on the various architectures. For AMD, Intel
and the XeonPhi the results are dominated by the number of iterations in remove from prev. On
XeonPhi the number of iterations in push increases significantly once the number of threads is
greater than 120. The reason for this could lie in the SMT based architecture with 61 physical
cores and the way instructions are scheduled [27]. For SPARC the situation is completely opposite:
The number of threads has almost no impact on the number of iterations in the remove-methods,
instead the number of iterations in push is increasing, but quite unsteadily.
It is likewise interesting to see how the data structure performs under “normal” conditions.
As can be seen in Figures 3 and 4, which give results from the Queue and HashMap benchmarks,
the number of threads has almost no measurable impact on the number of iterations for all three
methods: Numbers are more or less constant, with a few outliers in the HashMap benchmark were
we can see a a small increase around 4-16 threads that again decreases with a growing number of
threads.
12
0.0
0.1
0.2
0.3
0.4
GuardPtr Queue List reads List writes
benchmark
m
e
a
n
 µ
s/
op
AMD
0.0
0.1
0.2
0.3
0.4
GuardPtr Queue List reads List writes
benchmark
m
e
a
n
 µ
s/
op
Sparc
0.0
0.1
0.2
0.3
GuardPtr Queue List reads List writes
benchmark
m
e
a
n
 µ
s/
op
Intel
0.0
0.5
1.0
GuardPtr Queue List reads List writes
benchmark
m
e
a
n
 µ
s/
op
XeonPhi
LFRC HPBR EBR NEBR QSBR stamp
Figure 5: Base costs of the five plus one schemes in single thread runs.
4.2 Base costs
In this analysis we measure the base costs of the schemes. We use a single thread to eliminate
contention on the used data structures, so resulting performance differences are caused solely by
creating and destroying guard ptr instances. We also include the GuardPtr benchmark to measure
the pure overhead of creating and releasing guard ptr instances without any other operations
involved.
All benchmarks except HashMap were run on all machines using a single thread, 30 trials
and eight seconds runtime. The results are shown in Figure 5; “List reads” corresponds to the
List benchmark with a workload of 0% (i.e., read-only) and “List writes” corresponds to the List
benchmark with a workload of 100% (i.e., all operations are either insert or delete). The number
of elements for the List and Queue benchmarks was 10. The HashMap benchmark was excluded
here because it has a very high runtime dominated by the simulated calculations; the overhead for
allocating and releasing guard ptr’s is rather irrelevant.
Stamp-it performs very poorly in the GuardPtr benchmark due to the more expensive operations
to insert and remove the thread from the internal data structure. But the results show that there is
hardly any trace of this overhead in the other benchmarks; in some cases Stamp-it is the fastest of all
schemes. This is due to the fact that, just like NER and QSR, Stamp-it also uses the region guard
concept to amortize the cost of these insert and remove calls over a larger number of operations.
We again observe significant differences between Sparc and the Intel based architectures. On Sparc
LFRC is significantly slower then HP.
4.3 Scalability with threads
We now study the effect of increasing the number of threads that share a single instance of some
data structure.
Figure 6 shows the performance of the reclamation schemes in the Queue benchmark. Surpris-
ingly, LFRC performs by far the best on Sparc and on XeonPhi, but is by far the worst on Intel.
On AMD, HP has a huge performance drop when running with the maximum number of threads.
A similar effect can be seen by the other schemes as well, but much less significant. Apart from
these exceptions, all schemes seem to scale largely equally well in this scenario.
For the results of the List benchmark in Figure 7, LFRC has been excluded since it performs
13
0
500
1000
1500
1 4 8 16 24 32 48
threads
m
e
a
n
 µ
s/
op
AMD
0
250
500
750
1 4 8 16 24 32 48 64 80 96 128 160 192 256 384 512
threads
m
e
a
n
 µ
s/
op
Sparc
0
1000
2000
3000
4000
1 4 8 16 24 32 48 64 80 96 128 160
threads
m
e
a
n
 µ
s/
op
Intel
0
500
1000
1500
1 4 8 16 24 32 48 64 80 96 128 160 192 244
threads
m
e
a
n
 µ
s/
op
XeonPhi
LFRC HPBR EBR NEBR QSBR stamp
Figure 6: Performance of the Queue benchmark with varying number of threads.
0.0
2.5
5.0
7.5
10.0
1 4 8 16 24 32 48
threads
m
e
a
n
 µ
s/
op
AMD
0
50
100
150
1 4 8 16 24 32 48 64 80 96 128 160 192 256 384 512
threads
m
e
a
n
 µ
s/
op
Sparc
0
20
40
60
1 4 8 16 24 32 48 64 80 96 128 160
threads
m
e
a
n
 µ
s/
op
Intel
0
10
20
30
40
1 4 8 16 24 32 48 64 80 96 128 160 192 244
threads
m
e
a
n
 µ
s/
op
XeonPhi
HPBR EBR NEBR QSBR stamp
Figure 7: Performance of the List benchmark with 10 elements, a workload of 20% and varying number of
threads (without LFRC).
exceedingly poor in this scenario, especially with a larger number of threads. On AMD, Intel
and XeonPhi, all schemes are more on less on par, but on Sparc EB and NER show a significant
degradation when the number of threads grows beyond 128. What is surprising, though, is that in
all those cases NER performs consistently worse than ER. This is quite unexpected, since NER was
designed to have less overhead than ER. We did not investigate the reasons for this in more detail,
but one assumption is that this might be caused by a larger number of unsuccessful attempts to
update the global epoch, which could be caused by NER’s dependence on larger critical regions.
Finally, the results for the HashMap benchmark are shown in Figure 8. QSR has been excluded
because it scales very poorly on all architectures in this update-heavy scenario. On AMD, ER,
NER and Stamp-it scale almost perfectly, while LFRC’s and HP’s performance starts to degrade
once the number of threads grows beyond 16. On Intel, LFRC scales very poorly while all other
schemes scale more or less equally well, but not as well as on AMD. On XeonPhi on the other
hand, LFRC scales best while HP’s performance starts degrading with more than 16 threads, but
14
0
200
400
600
1 4 8 16 24 32 48
threads
m
e
a
n
 m
s/
op
AMD
0
10000
20000
30000
40000
50000
1 4 8 16 24 32 48 64 80 96 128 160 192 256 384 512
threads
m
e
a
n
 m
s/
op
Sparc
0
250
500
750
1000
1 4 8 16 24 32 48 64 80 96 128 160
threads
m
e
a
n
 m
s/
op
Intel
0
500
1000
1500
2000
1 4 8 16 24 32 48 64 80 96 128 160 192 244
threads
m
e
a
n
 m
s/
op
XeonPhi
LFRC HPBR EBR NEBR stamp
Figure 8: Performance of the HashMap benchmark with varying number of threads.
trial 1 trial 2 trial 3 trial 4 trial 5
0e+00
2e+05
4e+05
6e+05
0 50 100 150 200 250
collected sample
u
n
re
cl
ai
m
ed
 n
od
es
Sparc  (512 threads)
LFRC HPBR EBR NEBR QSBR stamp
Figure 9: Number of unreclaimed nodes over time in the HashMap benchmark on Sparc. The x-axis is the
current sample.
it again improves with more than 128 threads. The other schemes continuously loose performance
when the number of threads grows from 16 to ∼80, but then stays more or less the same.
The biggest surprise is the result on Sparc. Here, the performance of HP, ER and NER degrades
dramatically, while LFRC and Stamp-it scale almost perfectly. With 512 threads the performance
difference between LFRC/Stamp-it and the other schemes is a factor of ∼4000. The reason for
this will become clear when we look at the results of the reclamation efficiency analysis in the next
section.
4.4 Reclamation efficiency
This analysis focuses on how efficiently (fast) the various schemes actually reclaim retired nodes.
An increased reclamation efficiency can drastically reduce memory pressure, which in turn can have
a significant impact on the overall performance. Nonetheless, this aspect is usually disregarded in
analyses of concurrent reclamation schemes.
To measure reclamation efficiency we use thread-local performance counters that track the
number of allocated and reclaimed nodes. By calculating the differences, we get the number of
15
0
1000
2000
3000
4000
5000
1 2 3 4 5
trial
m
e
a
n
 m
s/
op
AMD
0
10000
20000
1 2 3 4 5
trial
m
e
a
n
 m
s/
op
Sparc
0
500
1000
1500
2000
1 2 3 4 5
trial
m
e
a
n
 m
s/
op
Intel
0
2500
5000
7500
1 2 3 4 5
trial
m
e
a
n
 m
s/
op
XeonPhi
LFRC HPBR EBR NEBR QSBR stamp
Figure 10: Development of runtime over time in the HashMap benchmark.
unreclaimed nodes, which is our measurement for efficiency; a smaller number of unreclaimed
nodes means that the reclamation scheme works more efficiently.
The plots in this analysis show the development of the number of unreclaimed nodes over time.
Each configuration is run with five trials, each with a runtime of eight seconds. During each trial
a total of 50 samples are collected. Since the benchmarks are randomized each configuration with
the five trials is run 20 times to account for any fluctuation in the measured samples. The plots
show the smoothed conditional means of the measured samples of those 20 runs over the number
of samples recorded during each run.
For reclamation efficiency, reference counting is the “gold standard”. In contrast to all other
schemes there is no delay: A node is reclaimed immediately when the last thread drops its reference
to that node. So in all the plots, LFRC can bee seen as the baseline against which all other schemes
have to be measured. One has to keep in mind, though, that LFRC is not a general reclamation
scheme, since the reclaimed nodes are not returned to the memory manager, but stored in the
internal free-list.
Figure 9 shows the results for the HashMap benchmark on Sparc. The results for the other
benchmarks and machines can be found in Appendix A.6.
What can be seen is that the number of unreclaimed nodes for HP, ER, NER and QSR is
constantly increasing. It does not even go down at the end of the trials when all threads are
stopped.
When a thread terminates, all schemes add the remaining nodes to a global list. But who is
responsible to reclaim them, and when? In Stamp-it the responsibility is transferred to the “last”
thread. Other schemes do not have a notion of a “last” thread, so the global retire-list is checked
by each thread when it performs reclamation on its local retire-list. When a thread tries to reclaim
nodes from the global list it steals the whole list, reclaims all reclaimable nodes and then re-adds
the remaining nodes to the global list. This leads to a race during the end of a trial; whoever steals
the list might not be able to reclaim all nodes yet, but when the remaining nodes are re-added
to the global list, there might be no threads left. Stamp-it mitigates this race as it is cheap to
check whether the global stamp has changed since reclamation has started, and so it can restart
reclamation with the new stamp value. Obviously, the effects of this race are more pronounced the
more threads are involved. The behavior in shown in Figure 33 in Appendix A.6 is a direct result
of this race.
The failure to efficiently reclaim nodes increases memory pressure, which has a direct impact
16
on the runtime. Figure 10 shows the development of the runtime over the five trials. On Sparc
we can see that the runtime of HP, ER, NER and QSR is increasing with each trial, while LFRC
and Stamp-it is decreasing. On the other architectures runtime is decreasing for all schemes except
QSR. This would be the expected behavior since more results can be reused once the hash-map
has been filled.
HP also performed very poorly on the other architectures when the number of threads becomes
very large. This is caused by the larger threshold for number of retired nodes to achieve amortized
constant processing-time.
5 Conclusion and Future Work
This paper introduced Stamp-it, a new, general purpose memory reclamation scheme with attractive
features. To the best of our knowledge, this is the first non-reference counting based scheme that
does not have to scan all other threads to determine reclaimability of a node.
We have also presented a large scale experimental study, comparing the performance of five plus
one reclamation schemes on four different architectures in various scenarios. Our empirical results
show that Stamp-it matches or outperforms the other analyzed reclamation schemes in almost all
cases.
All of the analyzed schemes are implemented in portable, standard conform C++, based on
the standardized interface proposed by Robison [29]; the full source code is available on GitHub
(https://github.com/mpoeter/emr).
For future work we plan to add an implementation of DEBRA [7] and include it in the benchmark
results. It might be interesting to look for other data structures that could replace the doubly linked
list, i.e., data structures that have less overhead while providing all the required properties. In this
context we might also try to relax some of these properties (e.g., use a partial order instead of a
strict order for thread entries) in order to reduce contention on the data structure.
References
[1] Z. Aghazadeh, W. Golab, and P. Woelfel. Making objects writable. In Proceedings of the 2014
ACM Symposium on Principles of Distributed Computing (PODC), pages 385–395. ACM,
2014.
[2] D. Alistarh, P. Eugster, M. Herlihy, A. Matveev, and N. Shavit. Stacktrack: An automated
transactional approach to concurrent memory reclamation. In Proceedings of the 9th ACM
European Conference on Computer Systems (EuroSys), pages 25:1–25:14. ACM, 2014.
[3] D. Alistarh, W. M. Leiserson, A. Matveev, and N. Shavit. Threadscan: Automatic and scalable
memory reclamation. In Proceedings of the 27th ACM Symposium on Parallelism in Algorithms
and Architectures (SPAA), pages 123–132. ACM, 2015.
[4] O. Balmau, R. Guerraoui, M. Herlihy, and I. Zablotchi. Fast and robust memory reclamation
for concurrent data structures. In Proceedings of the 28th ACM Symposium on Parallelism in
Algorithms and Architectures (SPAA), pages 349–359. ACM, 2016.
[5] E. D. Berger, K. S. McKinley, R. D. Blumofe, and P. R. Wilson. Hoard: A scalable memory
allocator for multithreaded applications. SIGOPS Oper. Syst. Rev., 34(5):117–128, 2000.
17
[6] A. Braginsky, A. Kogan, and E. Petrank. Drop the anchor: Lightweight memory management
for non-blocking data structures. In Proceedings of the 25th Annual ACM Symposium on
Parallelism in Algorithms and Architectures (SPAA), pages 33–42. ACM, 2013.
[7] T. A. Brown. Reclaiming memory for lock-free data structures: There has to be a better way.
In Proceedings of the 2015 ACM Symposium on Principles of Distributed Computing (PODC),
pages 261–270. ACM, 2015.
[8] N. Cohen and E. Petrank. Automatic memory reclamation for lock-free data structures. In
Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Pro-
gramming, Systems, Languages, and Applications (OOPSLA), pages 260–279. ACM, 2015.
[9] N. Cohen and E. Petrank. Efficient memory management for lock-free data structures with
optimistic access. In Proceedings of the 27th ACM Symposium on Parallelism in Algorithms
and Architectures (SPAA), pages 254–263. ACM, 2015.
[10] M. Desnoyers, P. E. McKenney, A. S. Stern, M. R. Dagenais, and J. Walpole. User-level
implementations of read-copy update. IEEE Transactions on Parallel and Distributed Systems,
23(2):375–382, 2012.
[11] D. L. Detlefs, P. A. Martin, M. Moir, and G. L. S. Jr. Lock-free reference counting. In Proceed-
ings of the 20th Annual ACM Symposium on Principles of Distributed Computing (PODC),
pages 190–199, 2001.
[12] J. Evans. A scalable concurrent malloc(3) implementation for freebsd, 2006.
[13] K. Fraser. Practical lock-freedom. PhD thesis, University of Cambridge Computer Laboratory,
2004.
[14] S. Ghemawat and P. Menage. Tcmalloc : Thread-caching malloc. http://goog-perftools.
sourceforge.net/doc/tcmalloc.html, 2011.
[15] A. Gidenstam, M. Papatriantafilou, H. Sundell, and P. Tsigas. Efficient and reliable lock-
free memory reclamation based on reference counting. IEEE Transactions on Parallel and
Distributed Systems, 20(8):1173–1187, 2009.
[16] R. Gupta. The fuzzy barrier: A mechanism for high speed synchronization of processors.
In Proceedings of the 3rd International Conference on Architectural Support for Programming
Languages and Operating Systems (ASPLOS), pages 54–63. ACM, 1989.
[17] T. L. Harris. A pragmatic implementation of non-blocking linked-lists. In Proceedings of the
15th International Conference on Distributed Computing (DISC), pages 300–314. Springer-
Verlag, 2001.
[18] T. E. Hart, P. E. McKenney, A. D. Brown, and J. Walpole. Performance of memory reclamation
for lockless synchronization. J. Parallel Distrib. Comput., 67(12):1270–1285, 2007.
[19] M. Herlihy, V. Luchangco, and M. Moir. The repeat offender problem: A mechanism for
supporting dynamic-sized, lock-free data structures. In Proceedings of the 16th International
Conference on Distributed Computing (DISC), pages 339–353. Springer-Verlag, 2002.
18
[20] P. E. Mckenney and J. D. Slingwine. Read-copy update: Using execution history to solve
concurrency problems. In Proceedings of the 1998 International Conference on Parallel and
Distributed Computing and Systems, 1998.
[21] M. M. Michael. High performance dynamic lock-free hash tables and list-based sets. In
Proceedings of the 14th Annual ACM Symposium on Parallel Algorithms and Architectures
(SPAA), pages 73–82. ACM, 2002.
[22] M. M. Michael. Hazard pointers: Safe memory reclamation for lock-free objects. IEEE Trans-
actions on Parallel and Distributed Systems, 15(6):491–504, 2004.
[23] M. M. Michael. Scalable lock-free dynamic memory allocation. SIGPLAN Not., 39(6):35–46,
2004.
[24] M. M. Michael and M. L. Scott. Correction of a memory management method for lock-free
data structures. Technical report, University of Rochester, 1995.
[25] M. M. Michael and M. L. Scott. Simple, fast, and practical non-blocking and blocking con-
current queue algorithms. In Proceedings of the 15th Annual ACM Symposium on Principles
of Distributed Computing (PODC), pages 267–275. ACM, 1996.
[26] M. M. Michael and M. Wong. Hazard pointers - safe resource reclamation for optimistic con-
currency. C++ standards committee paper, http://www.open-std.org/jtc1/sc22/wg21/
docs/papers/2016/p0233r0.pdf, 2016.
[27] R. Rahman. Intel R© Xeon PhiTMcore micro-architecture. https://software.intel.com/
en-us/articles/intel-xeon-phi-core-micro-architecture, 2013.
[28] P. Ramalhete and A. Correia. Brief announcement: Hazard eras - non-blocking memory
reclamation. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and
Architectures (SPAA), pages 367–369, 2017.
[29] A. D. Robison. Policy-based design for safe destruction in concurrent containers. C++ stan-
dards committee paper, http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/
n3712.pdf, 2013.
[30] H. Sundell. Wait-free reference counting and memory management. In Proceedings of the 19th
IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 24b–24b,
2005.
[31] H. Sundell and P. Tsigas. Lock-free deques and doubly linked lists. J. Parallel Distrib. Comput.,
68(7):1008–1020, 2008.
[32] J. D. Valois. Lock-Free Data Structures. PhD thesis, Rensselaer Polytechnic Institute, 1995.
19
A Additional results
This appendix contains additional results, some of which were briefly discussed but not shown in
the main text.
A.1 Lock-free Reference counting
Reference counting is prone to false sharing as the reference counter is part of the node, which can
usually be avoided by extra padding, however, at the cost of a higher memory overhead.
Another reason for the high overhead of LFRC is the global free-list that is shared by all threads
and can lead to high contention. A simple way to reduce this contention is to use fixed-size thread-
local free-lists as buffers. Both improvements, padding as well as the thread-local free-lists, have
been implemented here.
LFRC is often criticized for its bad performance. As already mentioned, we tried to improve
that by implementing the following two extensions: a) padding to avoid false sharing between the
reference counter and other node members, and b) bounded, thread-local free-lists. The following
variations of LFRC were used in the experiment:
• unpadded – LFRC without padding.
• unpadded-20 – like unpadded, but with local free-lists of 20 entries.
• padded – LFRC with padding
• padded-20 – like padded, but with local free-lists of 20 entries.
The results are shown in the Figures 11, 12, 13 and 14. The results are quite interesting as there
is no overall “best” configuration. Instead the performance of the different configurations varies
with both the benchmark data structure and the CPU architecture. However, in almost all cases
at least one of the other configurations is significantly faster than the original, unpadded LFRC.
0
100
200
300
400
500
1 2 4 8 16 24 32 48
threads
m
e
a
n
 µ
s/
op
AMD
0
250
500
750
1000
1 2 4 8 16 24 32 48 64 80 96 128 160 192 256 384 512
threads
m
e
a
n
 µ
s/
op
Sparc
0
1000
2000
3000
1 2 4 8 16 24 32 48 64 80 96 128 160
threads
m
e
a
n
 µ
s/
op
Intel
0
250
500
750
1000
1250
1 2 4 8 16 24 32 48 64 80 96 128 160 192 244
threads
m
e
a
n
 µ
s/
op
XeonPhi
unpadded unpadded−20 padded padded−20
Figure 11: Performance of different LFRC configurations with varying number of threads in the Queue
benchmark.
20
020
40
60
1 2 4 8 16 24 32 48
threads
m
e
a
n
 µ
s/
op
AMD
0
500
1000
1500
2000
1 2 4 8 16 24 32 48 64 80 96 128 160 192 256 384 512
threads
m
e
a
n
 µ
s/
op
Sparc
0
200
400
1 2 4 8 16 24 32 48 64 80 96 128 160
threads
m
e
a
n
 µ
s/
op
Intel
0
100
200
300
1 2 4 8 16 24 32 48 64 80 96 128 160 192 244
threads
m
e
a
n
 µ
s/
op
XeonPhi
unpadded unpadded−20 padded padded−20
Figure 12: Performance of different LFRC configurations with varying number of threads in the List bench-
mark with a workload of 20%.
0
50
100
150
200
1 2 4 8 16 24 32 48
threads
m
e
a
n
 µ
s/
op
AMD
0
500
1000
1500
2000
1 2 4 8 16 24 32 48 64 80 96 128 160 192 256 384 512
threads
m
e
a
n
 µ
s/
op
Sparc
0
250
500
750
1 2 4 8 16 24 32 48 64 80 96 128 160
threads
m
e
a
n
 µ
s/
op
Intel
0
250
500
750
1000
1 2 4 8 16 24 32 48 64 80 96 128 160 192 244
threads
m
e
a
n
 µ
s/
op
XeonPhi
unpadded unpadded−20 padded padded−20
Figure 13: Performance of different LFRC configurations with varying number of threads in the List bench-
mark with a workload of 80%.
A.2 Stamp-it List benchmark
Figures 15 and 16 show the Stamp-it results from the List benchmark with a workload of 20% and
80%. These results were excluded from the main text since they are qualitatively similar to the
results from the Queue benchmark as shown in Figure 3.
A.3 Scalability with workload
This analysis which was omitted from the main text uses the List benchmark to examine the
workload’s impact on the reclamation schemes, by gradually increasing the read-to-update ratio of
21
0100
200
1 2 4 8 16 24 32 48
threads
m
e
a
n
 m
s/
op
AMD
0
100
200
300
1 2 4 8 16 24 32 48 64 80 96 128 160 192 256 384 512
threads
m
e
a
n
 m
s/
op
Sparc
0
250
500
750
1000
1250
1 2 4 8 16 24 32 48 64 80 96 128 160
threads
m
e
a
n
 m
s/
op
Intel
0
400
800
1200
1 2 4 8 16 24 32 48 64 80 96 128 160 192 244
threads
m
e
a
n
 m
s/
op
XeonPhi
unpadded unpadded−20 padded padded−20
Figure 14: Performance of different LFRC configurations with varying number of threads in the HashMap
benchmark.
0.0
0.4
0.8
1.2
1 2 4 8 16 24 32 48
threads
m
e
a
n
 it
er
a
tio
ns AMD
0.00
0.25
0.50
0.75
1.00
1 2 4 8 16 24 32 48 64 80 96 128 160 192 256 384 512
threads
m
e
a
n
 it
er
a
tio
ns Sparc
0.0
0.3
0.6
0.9
1 2 4 8 16 24 32 48 64 80 96 128 160
threads
m
e
a
n
 it
er
a
tio
ns Intel
0.0
0.3
0.6
0.9
1.2
1 2 4 8 16 24 32 48 64 80 96 128 160 192 244
threads
m
e
a
n
 it
er
a
tio
ns XeonPhi
push remove_prev remove_next
Figure 15: Mean number of iterations for the respective operations in the List benchmark with workload
20%.
the performed operations from read-only to update-only.
When purely read-only operations are used no nodes get reclaimed, so the schemes only differ
in the performance overhead of acquiring and releasing the necessary guard ptr instances. With
an increasing number of update operations, the performance overhead for acquiring and releasing
the guard ptr instances stays the same (we still have to search the list the same way as for read-
only operations). But the more update operations are performed (specifically delete operations),
the more impact on the overall performance is caused by the reclamation of retired nodes. The
benchmark was run in four different configurations:
• one thread; one element (see Figure 17)
• one thread; 25 elements (see Figure 18)
• 32 threads; one element (see Figure 19 and 21)
22
0.0
0.3
0.6
0.9
1 2 4 8 16 24 32 48
threads
m
e
a
n
 it
er
a
tio
ns AMD
0.00
0.25
0.50
0.75
1.00
1 2 4 8 16 24 32 48 64 80 96 128 160 192 256 384 512
threads
m
e
a
n
 it
er
a
tio
ns Sparc
0.00
0.25
0.50
0.75
1.00
1 2 4 8 16 24 32 48 64 80 96 128 160
threads
m
e
a
n
 it
er
a
tio
ns Intel
0.0
0.3
0.6
0.9
1 2 4 8 16 24 32 48 64 80 96 128 160 192 244
threads
m
e
a
n
 it
er
a
tio
ns XeonPhi
push remove_prev remove_next
Figure 16: Mean number of iterations for the respective operations in the List benchmark with workload
80%.
0.00
0.05
0.10
0.15
0 0.01 0.02 0.05 0.1 0.25 0.5 0.75 1
workload
m
e
a
n
 µ
s/
op
AMD
0.00
0.05
0.10
0 0.01 0.02 0.05 0.1 0.25 0.5 0.75 1
workload
m
e
a
n
 µ
s/
op
Sparc
0.00
0.05
0.10
0.15
0 0.01 0.02 0.05 0.1 0.25 0.5 0.75 1
workload
m
e
a
n
 µ
s/
op
Intel
0.0
0.2
0.4
0.6
0 0.01 0.02 0.05 0.1 0.25 0.5 0.75 1
workload
m
e
a
n
 µ
s/
op
XeonPhi
LFRC HPBR EBR NEBR QSBR stamp
Figure 17: Effect of varying workload on a lock-free list with one element, one thread.
• 32 threads; 25 elements (see Figure 20 and 22)
Each configuration was run with 30 trials and a runtime of eight seconds. For LFRC the con-
figuration with padding and a local free-list with 20 entries was used; based on the results from
Section A.1 it seemed to be the overall best choice for this scenario.
As can be seen by the results of the various configurations, the workload by itself seems to have
no significant impact on the performance of the reclamation schemes; within each configuration and
architecture, all schemes exhibit roughly the same slope, i.e., the relative performance difference
between the schemes stays more or less the same, regardless of the workload. Hart et al. came to
the same conclusion in their experiments [18]. This is not entirely unexpected, since insert and
remove operations still require the same lookup to be performed as in a search operation. The only
exception is LFRC, which actually shows a performance improvement on Sparc in the configuration
with one element and 32 threads (see Figure 19), but it starts out with a huge gap to the other
schemes. It is not entirely clear why LFRC can improve its performance, but we suspect it is due
to the way of how LFRC reuses reclaimed nodes.
In the base cost analysis we saw that LFRC seems to incur a higher overhead on the Sparc
23
0.0
0.2
0.4
0.6
0.8
0 0.01 0.02 0.05 0.1 0.25 0.5 0.75 1
workload
m
e
a
n
 µ
s/
op
AMD
0.0
0.2
0.4
0.6
0 0.01 0.02 0.05 0.1 0.25 0.5 0.75 1
workload
m
e
a
n
 µ
s/
op
Sparc
0.0
0.2
0.4
0.6
0 0.01 0.02 0.05 0.1 0.25 0.5 0.75 1
workload
m
e
a
n
 µ
s/
op
Intel
0.0
0.5
1.0
1.5
2.0
0 0.01 0.02 0.05 0.1 0.25 0.5 0.75 1
workload
m
e
a
n
 µ
s/
op
XeonPhi
LFRC HPBR EBR NEBR QSBR stamp
Figure 18: Effect of varying workload on a lock-free list with 25 elements, one threads.
0
10
20
30
0 0.01 0.02 0.05 0.1 0.25 0.5 0.75 1
workload
m
e
a
n
 µ
s/
op
AMD
0
10
20
30
0 0.01 0.02 0.05 0.1 0.25 0.5 0.75 1
workload
m
e
a
n
 µ
s/
op
Sparc
0
50
100
0 0.01 0.02 0.05 0.1 0.25 0.5 0.75 1
workload
m
e
a
n
 µ
s/
op
Intel
0
10
20
0 0.01 0.02 0.05 0.1 0.25 0.5 0.75 1
workload
m
e
a
n
 µ
s/
op
XeonPhi
LFRC HPBR EBR NEBR QSBR stamp
Figure 19: Effect of varying workload on a lock-free list with one element, 32 threads.
0
20
40
0 0.01 0.02 0.05 0.1 0.25 0.5 0.75 1
workload
m
e
a
n
 µ
s/
op
AMD
0
10
20
30
0 0.01 0.02 0.05 0.1 0.25 0.5 0.75 1
workload
m
e
a
n
 µ
s/
op
Sparc
0
50
100
150
0 0.01 0.02 0.05 0.1 0.25 0.5 0.75 1
workload
m
e
a
n
 µ
s/
op
Intel
0
10
20
30
40
0 0.01 0.02 0.05 0.1 0.25 0.5 0.75 1
workload
m
e
a
n
 µ
s/
op
XeonPhi
LFRC HPBR EBR NEBR QSBR stamp
Figure 20: Effect of varying workload on a lock-free list with 25 elements, 32 threads.
24
010
20
0 0.01 0.02 0.05 0.1 0.25 0.5 0.75 1
workload
m
e
a
n
 µ
s/
op
AMD
0
1
2
3
4
0 0.01 0.02 0.05 0.1 0.25 0.5 0.75 1
workload
m
e
a
n
 µ
s/
op
Sparc
0
20
40
60
0 0.01 0.02 0.05 0.1 0.25 0.5 0.75 1
workload
m
e
a
n
 µ
s/
op
Intel
0
5
10
15
20
0 0.01 0.02 0.05 0.1 0.25 0.5 0.75 1
workload
m
e
a
n
 µ
s/
op
XeonPhi
HPBR EBR NEBR QSBR stamp
Figure 21: Effect of varying workload on a lock-free list with one element, 32 threads without LFRC.
0
3
6
9
0 0.01 0.02 0.05 0.1 0.25 0.5 0.75 1
workload
m
e
a
n
 µ
s/
op
AMD
0.0
0.5
1.0
0 0.01 0.02 0.05 0.1 0.25 0.5 0.75 1
workload
m
e
a
n
 µ
s/
op
Sparc
0
5
10
15
0 0.01 0.02 0.05 0.1 0.25 0.5 0.75 1
workload
m
e
a
n
 µ
s/
op
Intel
0.0
2.5
5.0
7.5
10.0
12.5
0 0.01 0.02 0.05 0.1 0.25 0.5 0.75 1
workload
m
e
a
n
 µ
s/
op
XeonPhi
HPBR EBR NEBR QSBR stamp
Figure 22: Effect of varying workload on a lock-free list with 25 elements, 32 threads without LFRC.
architecture. Figure 17 shows the results for the configuration with one element and one thread. In
this configuration HP performs worst in almost all cases, while LFRC on the other hand is almost
always fastest, or at least on par with the fastest scheme – with the exception of Sparc, where LFRC
performs worse than HP in virtually all scenarios. This pattern also emerges from the results of
all other configurations, which corroborates the observation from the base cost analysis that LFRC
performs worse on Sparc, and is thus less well suited for this architecture.
Naturally, LFRC performs significantly worse with a growing number of threads as can be seen
in Figures 18 and 20. What is quite interesting, though, is that in the scenario with a single element
(see Figure 18), on Sparc the performance of LFRC is actually increasing with a higher workload;
the other schemes and architectures do not show such an effect. Since these results are dominated
by the rather bad performance of LFRC, Figures 21 and 22 show the same results with LFRC
excluded.
What can be seen in Figure 21 is that in the configuration with 32 threads and a single element,
in the first scenarios, which have a low workload, on Sparc Stamp-it performs significantly worse
than the other schemes. But with an increased workload this performance difference completely
vanishes. The reason for this is the higher overhead in Stamp-it’s enter and leave functions. With
25
only a single element and a low workload, this overhead dominates the total work each thread is
handling. By increasing the workload, this overhead becomes much less relevant, while at the same
time efficient reclamation of the removed elements becomes more important. So in the scenarios
with higher workload Stamp-it shows much better performance. Obviously, an increased number
of elements also reduces the relevance of this overhead. The configuration with 32 threads and
25 elements even shows inversed results (see Figure 22); in this configuration Stamp-it clearly
outperforms all the other schemes on Sparc. Interestingly, the other architectures are largely
unaffected and show no such bias.
A.4 Scalability with traversal length
The number of elements in a list can also have an impact on how good the different reclamation
schemes perform. This analysis examines this impact by varying the number of elements the list
gets initialized with at the start of each trial from zero to 1000. It is also run in four different
configurations, each with 30 trials and a runtime of eight seconds:
• one thread; workload of zero (i.e., read-only) (see Figure 23)
• one thread; workload of 50% (see Figure 24)
• 32 threads; workload of zero (i.e., read-only) (see Figures 25 and 27)
• 32 threads; workload of 50% (see Figure 26 and 28)
Results can be seen in Figures 23, 24, 25, and 26.
The single threaded results for a read-only list (see Figure 23) and a workload of 50% (see
Figure 24) look almost identical. This corroborates the observations from the previous analysis
that the workload has no significant impact on the performance of the reclamation schemes.
What can be seen from the results of the single thread configurations is that with an increasing
traversal length the performance of LFRC and HP degrades. This is expected since, due to their
design, these schemes have a per-element overhead. It is interesting though, that this effect varies
in intensity depending on the respective architecture.
0
1
2
0 1 5 10 25 50 100
elements
m
e
a
n
 µ
s/
op
AMD
0.0
0.5
1.0
1.5
2.0
2.5
0 1 5 10 25 50 100
elements
m
e
a
n
 µ
s/
op
Sparc
0.0
0.5
1.0
1.5
2.0
0 1 5 10 25 50 100
elements
m
e
a
n
 µ
s/
op
Intel
0
2
4
6
0 1 5 10 25 50 100
elements
m
e
a
n
 µ
s/
op
XeonPhi
LFRC HPBR EBR NEBR QSBR stamp
Figure 23: Effect of varying traversal length on a read-only lock-free list with one thread.
26
01
2
0 1 5 10 25 50 100
elements
m
e
a
n
 µ
s/
op
AMD
0.0
0.5
1.0
1.5
2.0
2.5
0 1 5 10 25 50 100
elements
m
e
a
n
 µ
s/
op
Sparc
0.0
0.5
1.0
1.5
2.0
0 1 5 10 25 50 100
elements
m
e
a
n
 µ
s/
op
Intel
0
2
4
6
0 1 5 10 25 50 100
elements
m
e
a
n
 µ
s/
op
XeonPhi
LFRC HPBR EBR NEBR QSBR stamp
Figure 24: Effect of varying traversal length on a lock-free list with one thread and a workload of 50%.
0
20
40
60
80
0 1 5 10 25 50 100
elements
m
e
a
n
 µ
s/
op
AMD
0
10
20
30
0 1 5 10 25 50 100
elements
m
e
a
n
 µ
s/
op
Sparc
0
100
200
0 1 5 10 25 50 100
elements
m
e
a
n
 µ
s/
op
Intel
0
10
20
30
40
50
0 1 5 10 25 50 100
elements
m
e
a
n
 µ
s/
op
XeonPhi
LFRC HPBR EBR NEBR QSBR stamp
Figure 25: Effect of varying traversal length on a read-only lock-free list with 32 threads.
0
10
20
30
0 1 5 10 25 50 100
elements
m
e
a
n
 µ
s/
op
AMD
0
10
20
30
0 1 5 10 25 50 100
elements
m
e
a
n
 µ
s/
op
Sparc
0
40
80
120
0 1 5 10 25 50 100
elements
m
e
a
n
 µ
s/
op
Intel
0
10
20
30
40
50
0 1 5 10 25 50 100
elements
m
e
a
n
 µ
s/
op
XeonPhi
LFRC HPBR EBR NEBR QSBR stamp
Figure 26: Effect of varying traversal length on a lock-free list with 32 threads and a workload of 50%.
27
01
2
0 1 5 10 25 50 100
elements
m
e
a
n
 µ
s/
op
AMD
0
1
2
3
4
0 1 5 10 25 50 100
elements
m
e
a
n
 µ
s/
op
Sparc
0
1
2
3
0 1 5 10 25 50 100
elements
m
e
a
n
 µ
s/
op
Intel
0
2
4
6
0 1 5 10 25 50 100
elements
m
e
a
n
 µ
s/
op
XeonPhi
HPBR EBR NEBR QSBR stamp
Figure 27: Effect of varying traversal length on a read-only lock-free list with 32 threads (without LFRC).
0
5
10
15
0 1 5 10 25 50 100
elements
m
e
a
n
 µ
s/
op
AMD
0.0
0.5
1.0
1.5
2.0
0 1 5 10 25 50 100
elements
m
e
a
n
 µ
s/
op
Sparc
0
10
20
30
0 1 5 10 25 50 100
elements
m
e
a
n
 µ
s/
op
Intel
0
5
10
0 1 5 10 25 50 100
elements
m
e
a
n
 µ
s/
op
XeonPhi
HPBR EBR NEBR QSBR stamp
Figure 28: Effect of varying traversal length on a lock-free list with 32 threads and a workload of 50%
(without LFRC).
When looking at the results of the 32 thread configurations (Figures 25 and 26), LFRC’s runtime
goes through the roof—especially in the read-only case. Therefore Figures 27 and 28 show the same
results, but with LFRC excluded. From these results one can see that in the read-only configuration
the additional overhead of HP is highly significant, but becomes negligible when looking at the
results with 50% workload.
For all the other schemes the results suggests that the traversal length does not have a significant
impact on their respective performance. This is not unexpected as NER, QSR and Stamp-it all
benefit from the use of a region guard to amortize overhead over a number of operations. ER does
not use the concept of region guards, so the number of attempts to update the global epoch is in
direct proportion to the number of created guard ptr instances, and is thus directly linked to the
28
010
20
1 4 8 16 24 32 48
threads
m
e
a
n
 µ
s/
op
AMD
0
100
200
300
1 4 8 16 24 32 48 64 80 96 128 160 192 256 384 512
threads
m
e
a
n
 µ
s/
op
Sparc
0
50
100
150
1 4 8 16 24 32 48 64 80 96 128 160
threads
m
e
a
n
 µ
s/
op
Intel
0
50
100
1 4 8 16 24 32 48 64 80 96 128 160 192 244
threads
m
e
a
n
 µ
s/
op
XeonPhi
HPBR EBR NEBR QSBR stamp
Figure 29: Performance of the List benchmark with 10 elements, a workload of 80% and a varying number
of threads (without LFRC).
number of elements in the list. An indication of this can be seen in the Intel results in Figure 28,
but overall this overhead is less relevant than one might have expected.
A.5 Adding threads
Figure 29 shows the results of the List benchmark with a workload of 80% and a varying number
of threads. These results were excluded from the main text since they are qualitatively similar to
the results with a workload of 20% shown in Figure 7.
A.6 Reclamation efficiency
The results are shown in Figures 30, 31, 32, and 33. What can be seen in all scenarios is that
HP’s efficiency is inversely proportional to the number of threads. This is due to the fact that the
threshold for the number of unreclaimed nodes is quadratic in the number of threads. This is the
case even for the Queue benchmark (Figure 30) and List benchmarks (Figures 31 and 32), even
though the number of hazard pointers per thread is constant in these scenarios. In the HashMap
benchmark (Figure 33) a dynamic number of hazard pointers is used, which makes the situation
even worse.
The implementation allows to customize the calculation of this threshold, so for future work it
might be interesting to analyse how a different threshold would affect reclamation efficiency and
what impact this would have on the performance.
In the Queue and List benchmarks on AMD we can see a small bump in the number of unre-
claimed nodes during the first trial for all reclamation schemes except LFRC and HP. After the
first trial they all recover and perform comparably for the rest of the benchmark. It is not entirely
clear what causes this behavior as we did not investigate further.
Apart from this behavior and the previously described issue of HP with a large number of
29
trial 1 trial 2 trial 3 trial 4 trial 5
0e+00
1e+05
2e+05
3e+05
0 50 100 150 200 250
collected sample
u
n
re
cl
ai
m
ed
 n
od
es
AMD  (48 threads)
trial 1 trial 2 trial 3 trial 4 trial 5
0
25000
50000
75000
100000
0 50 100 150 200 250
collected sample
u
n
re
cl
ai
m
ed
 n
od
es
Intel  (160 threads)
trial 1 trial 2 trial 3 trial 4 trial 5
0
50000
100000
150000
0 50 100 150 200 250
collected sample
u
n
re
cl
ai
m
ed
 n
od
es
XeonPhi  (244 threads)
trial 1 trial 2 trial 3 trial 4 trial 5
0
2500000
5000000
7500000
0 50 100 150 200 250
collected sample
u
n
re
cl
ai
m
ed
 n
od
es
Sparc  (512 threads)
LFRC HPBR EBR NEBR QSBR stamp
Figure 30: Number of unreclaimed of nodes over time in the Queue benchmark.
30
trial 1 trial 2 trial 3 trial 4 trial 5
0
50000
100000
150000
0 50 100 150 200 250
collected sample
u
n
re
cl
ai
m
ed
 n
od
es
AMD  (48 threads)
trial 1 trial 2 trial 3 trial 4 trial 5
0
20000
40000
60000
80000
0 50 100 150 200 250
collected sample
u
n
re
cl
ai
m
ed
 n
od
es
Intel  (160 threads)
trial 1 trial 2 trial 3 trial 4 trial 5
0
50000
100000
150000
0 50 100 150 200 250
collected sample
u
n
re
cl
ai
m
ed
 n
od
es
XeonPhi  (244 threads)
trial 1 trial 2 trial 3 trial 4 trial 5
0
2500000
5000000
7500000
0 50 100 150 200 250
collected sample
u
n
re
cl
ai
m
ed
 n
od
es
Sparc  (512 threads)
LFRC HPBR EBR NEBR QSBR stamp
Figure 31: Number of unreclaimed nodes over time in the List benchmark with 10 elements and a workload
of 20%.
31
trial 1 trial 2 trial 3 trial 4 trial 5
0e+00
1e+05
2e+05
3e+05
0 50 100 150 200 250
collected sample
u
n
re
cl
ai
m
ed
 n
od
es
AMD  (48 threads)
trial 1 trial 2 trial 3 trial 4 trial 5
0
20000
40000
60000
0 50 100 150 200 250
collected sample
u
n
re
cl
ai
m
ed
 n
od
es
Intel  (160 threads)
trial 1 trial 2 trial 3 trial 4 trial 5
0
50000
100000
150000
0 50 100 150 200 250
collected sample
u
n
re
cl
ai
m
ed
 n
od
es
XeonPhi  (244 threads)
trial 1 trial 2 trial 3 trial 4 trial 5
0.0e+00
2.5e+06
5.0e+06
7.5e+06
1.0e+07
0 50 100 150 200 250
collected sample
u
n
re
cl
ai
m
ed
 n
od
es
Sparc  (512 threads)
LFRC HPBR EBR NEBR QSBR stamp
Figure 32: Number of unreclaimed nodes over time in the List benchmark with 10 elements and a workload
of 80%.
32
trial 1 trial 2 trial 3 trial 4 trial 5
0e+00
1e+05
2e+05
3e+05
4e+05
0 50 100 150 200 250
collected sample
u
n
re
cl
ai
m
ed
 n
od
es
AMD  (48 threads)
trial 1 trial 2 trial 3 trial 4 trial 5
0
250000
500000
750000
1000000
0 50 100 150 200 250
collected sample
u
n
re
cl
ai
m
ed
 n
od
es
Intel  (160 threads)
trial 1 trial 2 trial 3 trial 4 trial 5
0e+00
2e+05
4e+05
6e+05
0 50 100 150 200 250
collected sample
u
n
re
cl
ai
m
ed
 n
od
es
XeonPhi  (244 threads)
trial 1 trial 2 trial 3 trial 4 trial 5
0e+00
2e+05
4e+05
6e+05
0 50 100 150 200 250
collected sample
u
n
re
cl
ai
m
ed
 n
od
es
Sparc  (512 threads)
LFRC HPBR EBR NEBR QSBR stamp
Figure 33: Number of unreclaimed nodes over time in the HashMap benchmark.
33
threads, the results for the Queue and List benchmarks are not too surprising; all schemes perform
more or less comparably. In the Queue benchmark QSR performs somewhat worse on Intel and
XeonPhi, but this is not unexpected as QSR is less well suited for update heavy scenarios.
In the HashMap benchmark (Figure 33) we can see that QSR basically fails completely to
reliably reclaim nodes on all the architectures. The number of nodes is constantly increasing and
does not even go down at the end of the trials when all threads are stopped. This is also the reason
why QSR showed such bad performance in the previous analysis in Section 4.3.
For HP we can also see a consistent increase in the number of unreclaimed nodes over time,
even though this number sharply drops right at the beginning of a new trial, but also increases
again very rapidly. The only exception is Sparc, where no such drop occurs and the number of
nodes is increasing all the time. The other schemes all perform relatively good on all architectures;
the exception again being Sparc. On Sparc HP, ER, NER and QSR are all performing equally
bad. The number of unreclaimed nodes is constantly increasing and does not even go down at
the end of the trials when all threads are stopped. This effect is probably caused by the fact that
in these schemes every thread is responsible for reclaiming its own retired nodes. In Stamp-it we
know if there is some other thread lagging behind, so we can add nodes to a global list and let
that thread take responsibility for reclaiming them. This allows Stamp-it to more reliably reclaim
nodes, especially at the end of each trial.
The failure to efficiently reclaim nodes increases memory pressure, which has a direct impact
on the runtime. Figure 10 shows the development of the runtime over the five trials. On Sparc we
can see that the runtime of HP, ER, NER and QSR is increasing with each trial, while LFRC and
Stamp-it is decreasing. On the other architectures, runtime is decreasing for all schemes except
QSR. This would be the expected behavior since more results can be reused once the hash-map
has been filled.
34
