Analyzing the Performance of Lock-Free Data Structures: A Conflict-based
  Model by Atalar, Aras et al.
Analyzing the Performance of Lock-Free
Data Structures: A Conflict-based Model
Aras Atalar, Paul Renaud-Goud and Philippas Tsigas
Chalmers University of Technology
{aaras|goud|tsigas}@chalmers.se
Abstract
This paper considers the modeling and the analysis of the performance of lock-free concurrent
data structures. Lock-free designs employ an optimistic conflict control mechanism, allowing several
processes to access the shared data object at the same time. They guarantee that at least one
concurrent operation finishes in a finite number of its own steps regardless of the state of the
operations. Our analysis considers such lock-free data structures that can be represented as linear
combinations of fixed size retry loops.
Our main contribution is a new way of modeling and analyzing a general class of lock-free
algorithms, achieving predictions of throughput that are close to what we observe in practice.
We emphasize two kinds of conflicts that shape the performance: (i) hardware conflicts, due to
concurrent calls to atomic primitives; (ii) logical conflicts, caused by simultaneous operations on
the shared data structure.
We show how to deal with these hardware and logical conflicts separately, and how to combine
them, so as to calculate the throughput of lock-free algorithms. We propose also a common
framework that enables a fair comparison between lock-free implementations by covering the whole
contention domain, together with a better understanding of the performance impacting factors.
This part of our analysis comes with a method for calculating a good back-off strategy to finely
tune the performance of a lock-free algorithm. Our experimental results, based on a set of widely
used concurrent data structures and on abstract lock-free designs, show that our analysis follows
closely the actual code behavior.
ar
X
iv
:1
50
8.
03
56
6v
1 
 [c
s.D
S]
  1
4 A
ug
 20
15
1Contents
I Introduction 2
II Related Work 3
III Problem Statement 3
III-A Running Program and Targeted Platform . . . . . . . . . . . . . . . . . . . . . 3
III-B Examples and Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
III-B1 Immediate Upper Bounds . . . . . . . . . . . . . . . . . . . . . . . . 5
III-B2 Conflicts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
III-B3 Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
IV Execution without hardware conflict 6
IV-A Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
IV-A1 Initial Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
IV-A2 Notations and Definitions . . . . . . . . . . . . . . . . . . . . . . . . 7
IV-B Cyclic Executions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
IV-C Throughput Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
V Expansion and Complete Throughput Estimation 19
V-A Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
V-B Throughput Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
V-C Several Retry Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
V-C1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
V-C2 Wasted Retries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
V-C3 Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
VI Experimental Evaluation 23
VI-A Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
VI-B Synthetic Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
VI-B1 Single retry loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
VI-B2 Several retry loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
VI-C Treiber’s Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
VI-D Shared Counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
VI-E DeleteMin in Priority List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
VI-F Enqueue-Dequeue on a Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
VI-G Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
VI-H Back-Off Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
VII Conclusion 33
References 33
2I. Introduction
Lock-free programming provides highly concurrent access to data and has been increasing its
footprint in industrial settings. Providing a modeling and an analysis framework capable of describing
the practical performance of lock-free algorithms is an essential, missing resource necessary to the
parallel programming and algorithmic research communities in their effort to build on previous
intellectual efforts. The definition of lock-freedom mainly guarantees that at least one concurrent
operation on the data structure finishes in a finite number of its own steps, regardless of the state of
the operations. On the individual operation level, lock-freedom cannot guarantee that an operation
will not starve.
The goal of this paper is to provide a way to model and analyze the practically observed
performance of lock-free data structures. In the literature, the common performance measure of
a lock-free data structure is the throughput, i.e. the number of successful operations per unit of
time. It is obtained while threads are accessing the data structure according to an access pattern
that interleaves local work between calls to consecutive operations on the data structure. Although
this access pattern to the data structure is significant, there is no consensus in the literature on
what access to be used when comparing two data structures. So, the amount of local work (that we
will refer as parallel work for the rest of the paper) could be constant ([MS96], [SL00]), uniformly
distributed ([HSY10], [DLM13]), exponentially distributed ([Val94], [DB08]), null ([KH14], [LJ13]),
etc., and more questionably, the average amount is rarely scanned, which leads to a partial covering
of the contention domain.
We propose here a common framework enabling a fair comparison between lock-free data
structures, while exhibiting the main phenomena that drive performance, and particularly the
contention, which leads to different kinds of conflicts. As this is the first step in this direction,
we want to deeply analyze the core of the problem, without impacting factors being diluted within
a probabilistic smoothing. Therefore, we choose a constant local work, hence constant access rate
to the data structures. In addition to the prediction of the data structure performance, our model
provides a good back-off strategy, that achieves the peak performance of a lock-free algorithm.
Two kinds of conflict appear during the execution of a lock-free algorithm, both of them leading
to additional work. Hardware conflicts occur when concurrent operations call atomic primitives on
the same data: these calls collide and conduct to stall time, that we name here expansion. Logical
conflicts take place if concurrent operations overlap: because of the lock-free nature of the algorithm,
several concurrent operations can run simultaneously, but only one retry can logically succeed. We
show that the additional work produced by the failures is not necessarily harmful for the system-wise
performance.
We then show how throughput can be computed by connecting these two key factors in an iterative
way. We start by estimating the expansion probabilistically, and emulate the effect of stall time
introduced by the hardware conflicts as extra work added to each thread. Then we estimate the
number of failed operations, that in turn lead to additional extra work, by computing again the
expansion on a system setting where those two new amounts of work have been incorporated, and
reiterate the process; the convergence is ensured by a fixed-point search.
We consider the class of lock-free algorithms that can be modeled as a linear composition of fixed
size retry loops. This class covers numerous extensively used lock-free designs such as stacks [Tre86]
(Pop, Push), queues [MS96] (Enqueue, Dequeue), counters [DLM13] (Increment, Decrement) and
priority queues [LJ13] (DeleteMin).
To evaluate the accuracy of our model and analysis framework, we performed experiments both
on synthetic tests, that capture a wide range of possible abstract algorithmic designs, and on several
reference implementations of extensively studied lock-free data structures. Our evaluation results
reveal that our model is able to capture the behavior of all the synthetic and real designs for
3all different numbers of threads and sizes of parallel work (consequently also contention). We also
evaluate the use of our analysis as a tool for tuning the performance of lock-free code by selecting
the appropriate back-off strategy that will maximize throughput by comparing our method with
against widely known back-off policies, namely linear and exponential.
The rest of the paper is organized as follows. We discuss related work in Section II, then the
problem is formally described in Section III. We consider the logical conflicts in the absence of
hardware conflicts in Section IV, while in Section V, we firstly show how to compute the expansion,
then combine hardware and logical conflicts to obtain the final throughput estimate. We describe
the experimental results in Section VI.
II. Related Work
Anderson et al. [ARJ97] evaluated the performance of lock-free objects in a single processor real-
time system by emphasizing the impact of retry loop interference. Tasks can be preempted during
the retry loop execution, which can lead to interference, and consequently to an inflation in retry
loop execution due to retries. They obtained upper bounds for the number of interferences under
various scheduling schemes for periodic real-time tasks.
Intel [Int13] conducted an empirical study to illustrate performance and scalability of locks. They
showed that the critical section size, the time interval between releasing and re-acquiring the lock
(that is similar to our parallel section size) and number of threads contending the lock are vital
parameters.
Failed retries do not only lead to useless effort but also degrade the performance of successful
ones by contending the shared resources. Alemany et al. [AF92] have pointed out this fact, that is in
accordance with our two key factors, and, without trying to model it, have mitigated those effects
by designing non-blocking algorithms with operating system support.
Alistarh et al. [ACS14] have studied the same class of lock-free structures that we consider in this
paper. The analysis is done in terms of scheduler steps, in a system where only one thread can be
scheduled (and can then run) at each step. If compared with execution time, this is particularly
appropriate to a system with a single processor and several threads, or to a system where the
instructions of the threads cannot be done in parallel (e.g. multi-threaded program on a multi-core
processor with only read and write on the same cache line of the shared memory). In our paper,
the execution is evaluated in terms of processor cycles, strongly related to the execution time. In
addition, the “parallel work” and the “critical work” can be done in parallel, and we only consider
retry-loops with one Read and one CAS, which are serialized. In addition, they bound the asymptotic
expected system latency (with a big O, when the number of threads tends to infinity), while in our
paper we estimate the throughput (close to the inverse of system latency) for any number of threads.
III. Problem Statement
A. Running Program and Targeted Platform
In this paper, we aim at evaluating the throughput of a multi-threaded algorithm that is based
on the utilization of a shared lock-free data structure. Such a program can be abstracted by the
Procedure AbstractAlgorithm (see Figure 1) that represents the skeleton of the function which is
called by each spawned thread. It is decomposed in two main phases: the parallel section, represented
on line 3, and the retry loop, from line 4 to line 7. A retry starts at line 5 and ends at line 7.
As for line 1, the function Initialization shall be seen as an abstraction of the delay between the
spawns of the threads, that is expected not to be null, even when a barrier is used. We then consider
that the threads begin at the exact same time, but have different initialization times.
The parallel section is the part of the code where the thread does not access the shared data
structure; the work that is performed inside this parallel section can possibly depend on the value
4Procedure AbstractAlgorithm
1 Initialization();
2 while ! done do
3 Parallel_Work();
4 while ! success do
5 current ← Read(AP);
6 new ← Critical_Work(current);
7 success ← CAS(AP, current, new);
Figure 1: Thread procedure
Cycle
T0
T1
T2
T3
Figure 2: Execution with one wasted retry, and one inevitable failure
that has been read from the data structure, e.g. in the case of processing an element that has been
dequeued from a FIFO (First-In-First-Out) queue.
In each retry, a thread tries to modify the data structure, and does not exit the retry loop until
it has successfully modified the data structure. It does that by firstly reading the access point AP of
the data structure, then according to the value that has been read, and possibly to other previous
computations that occurred in the past, the thread prepares the new desired value as an access
point of the data structure. Finally, it atomically tries to perform the change through a call to the
Compare-And-Swap (CAS) primitive. If it succeeds, i.e. if the access point has not been changed
by another thread between the first Read and the CAS, then it goes to the next parallel section,
otherwise it repeats the process. The retry loop is composed of at least one retry, and we number
the retries starting from 0, since the first iteration of the retry loop is actually not a retry, but a
try.
We analyze the behavior of AbstractAlgorithm from a throughput perspective, which is defined
Cycle
T0
T1
T2
T3
Figure 3: Execution with minimum number of failures
5as the number of successful data structure operations per unit of time. In the context of Proce-
dure AbstractAlgorithm, it is equivalent to the number of successful CASs.
The throughput of the lock-free algorithm, that we denote by T , is impacted by several parameters.
• Algorithm parameters: the amount of work inside a call to Parallel_Work (resp. Critical_Work)
denoted by pw (resp. cw).
• Platform parameters: Read and CAS latencies (rc and cc respectively), and the number P of
processing units (cores). We assume homogeneity for the latencies, i.e. every thread experiences
the same latency when accessing an uncontended shared data, which is achieved in practice by
pinning threads to the same socket.
B. Examples and Issues
We first present two straightforward upper bounds on the throughput, and describe the two kinds
of conflict that keep the actual throughput away from those upper bounds.
1) Immediate Upper Bounds: Trivially, the minimum amount of work rlw(-) in a given retry is
rlw(-) = rc + cw + cc, as we should pay at least the memory accesses and the critical work cw in
between.
Thread-wise: A given thread can at most perform one successful retry every pw + rlw(-) units
of time. In the best case, P threads can then lead to a throughput of P/(pw + rlw(-)).
System-wise: By definition, two successful retries cannot overlap, hence we have at most 1
successful retry every rlw(-) units of time.
Altogether, the throughput T is bounded by
T ≤ min
( 1
rc + cw + cc ,
P
pw + rc + cw + cc
)
, i.e.
T ≤
{ 1
rc+cw+cc if pw ≤ (P − 1)(rc + cw + cc)
P
pw+rc+cw+cc otherwise.
(1)
2) Conflicts:
Logical conflicts: Equation 1 expresses the fact that when pw is small enough, i.e. when pw ≤
(P −1)rlw(-), we cannot expect that every thread performs a successful retry every pw+ rlw(-) units
of time, since it is more than what the retry loop can afford. As a result, some logical conflicts,
hence unsuccessful retries, will be inevitable, while the others, if any, are called wasted.
However, different executions can lead to different numbers of failures, which end up with different
throughput values. Figures 2 and 3 depict two executions, where the black parts are the calls to
Initialization, the blue parts are the parallel sections, and the retries can be either unsuccessful — in
red — or successful — in green. We experiment different initialization times, and observe different
synchronizations, hence different numbers of wasted retries. After the initial transient state, the
execution depicted in Figure 3 comprises only the inevitable unsuccessful retries, while the execution
of Figure 2 contains one wasted retry.
We can see on those two examples that a cyclic execution is reached after the transient behavior;
actually, we show in Section IV that, in the absence of hardware conflicts, every execution will become
periodic, if the initialization times are spaced enough. In addition, we prove that the shortest period
is such that, during this period, every thread succeeds exactly once. This finally leads us to define
the additional failures as wasted, since we can directly link the throughput with this number of
wasted retries: a higher number of wasted retries implying a lower throughput.
6Read & cw Previouslyexpanded CASExpansion
CAS
Figure 4: Expansion
Hardware conflicts: The requirement of atomicity compels the ownership of the data in an
exclusive manner by the executing core. This fact prohibits concurrent execution of atomic instruc-
tions if they are operating on the same data. Therefore, overlapping parts of atomic instructions
are serialized by the hardware, leading to stalls in subsequently issued ones. For our target lock-free
algorithm, these stalls that we refer to as expansion become an important slowdown factor in case
threads interfere in the retry loop. As illustrated in Figure 4, the latency for CAS can expand and
cause remarkable decreases in throughput since the CAS of a successful thread is then expanded by
others; for this reason, the amount of work inside a retry is not constant, but is, generally speaking,
a function depending on the number of threads that are inside the retry loop.
3) Process: We deal with the two kinds of conflicts separately and connect them together through
the fixed-point iterative convergence.
In Section V-A, we compute the expansion in execution time of a retry, noted e, by following
a probabilistic approach. The estimation takes as input the expected number of threads inside the
retry loop at any time, and returns the expected increase in the execution time of a retry due to the
serialization of atomic primitives.
In Section IV, we are given program without hardware conflict described by the size of the parallel
section pw(+) and the size of a retry rlw(+). We compute upper and lower bounds on the throughput
T , the number of wasted retries w, and the average number of threads inside the retry loop Prl.
Without loss of generality, we can normalize those execution times by the execution time of a retry,
and define the parallel section size as pw(+) = q + r, where q is a non-negative integer and r is such
that 0 ≤ r < 1. This pair (together with the number of threads P ) constitutes the actual input of
the estimation.
Finally, we combine those two outcomes in Section V-B by emulating expansion through work not
prone to hardware conflicts and obtain the full estimation of the throughput.
IV. Execution without hardware conflict
We show in this section that, in the absence of hardware conflicts, the execution becomes periodic,
which eases the calculation of the throughput. We start by defining some useful concepts: (f, P )-
cyclic executions are special kind of periodic executions such that within the shortest period, each
thread performs exactly f unsuccessful retries and 1 successful retry. The well-formed seed is a set
of events that allows us to detect an (f, P )-cyclic execution early, and the gaps are a measure of the
quality of the synchronization between threads. The idea is to iteratively add threads into the game
and show that the periodicity is maintained. Theorem 1 establishes a fundamental relation between
gaps and well-formed seeds, while Theorem 2 proves the periodicity, relying on the disjoint cases
of Lemma 1, 3, and 4. Finally, we exhibit upper and lower bounds on throughput and number of
failures, along with the average number of threads inside the retry loop.
A. Setting
1) Initial Restrictions:
7Remark 1. Concerning correctness, we assume that the reference point of the Read and the CAS
occurs when the thread enters and exits any retry, respectively.
Remark 2. We do not consider simultaneous events, so all inequalities that refer to time comparison
are strict, and can be viewed as follows: time instants are real numbers, and can be equal, but every
event is associated with a thread; also, in order to obtain a strict order relation, we break ties
according to the thread numbers (for instance with the relation <).
2) Notations and Definitions: We recall that P threads are executing the pseudo-code described in
Procedure AbstractAlgorithm, one retry is of unit-size, and the parallel section is of size pw(+) = q+r,
where q is a non-negative integer and r is such that 0 ≤ r < 1. Considering a thread Tn which succeeds
at time Sn; this thread completes a whole retry in 1 unit of time, then executes the parallel section of
size pw(+), and attempts to perform again the operation every unit of time, until one of the attempt
is successful.
Definition 1. An execution with P threads is called (C,P )-cyclic execution if and only if (i) the
execution is periodic, i.e. at every time, every thread is in the same state as one period before, (ii)
the shortest period contains exactly one successful attempt per thread, (iii) the shortest period is
1 + q + r + C.
Definition 2. Let S = (Ti, Si)i∈J0,P−1K, where Ti are threads and Si ordered times, i.e. such that
S0 < S1 < · · · < SP−1. S is a seed if and only if for all i ∈ J0, P − 1K, Ti does not succeed between
S0 and Si, and starts a retry at Si.
We define f (S) as the smallest non-negative integer such that S0 + 1 + q + r + f (S) > SP−1 + 1,
i.e. f (S) = max (0, dSP−1 − S0 − q − re). When S is clear from the context, we denote f (S) by f .
Definition 3. S is a well-formed seed if and only if for each i ∈ J0, P − 1K, the execution of thread Ti
contains the following sequence: firstly a success beginning at Si, the parallel section, f unsuccessful
retries, and finally a successful retry.
Those definitions are coupled through the two natural following properties:
Property 1. Given a (C,P )-cyclic execution, any seed S including P consecutive successes is a
well-formed seed, with f (S) = C.
Proof: Choosing any set of P consecutive successes, we are ensured, by the definition of a (f, P )-
cyclic execution, that for each thread, after the first success, the next success will be obtained after
f failures. The order will be preserved, and this shows that a seed including our set of successes is
actually a well-formed seed.
Property 2. If there exists a well-formed seed in an execution, then after each thread succeeded once,
the execution coincides with an (f, P )-cyclic execution.
Proof: By the definition of a well-formed seed, we know that the threads will first succeed in
order, fails f times, and succeed again in the same order. Considering the second set of successes
in a new well-formed seed, we observe that the threads will succeed a third time in the same order,
after failing f times. By induction, the execution coincides with an (f, P )-cyclic execution.
Together with the seed concept, we define the notion of gap that we will use extensively in the
next subsection. The general idea of those gaps is that within an (f, P )-cyclic execution, the period
is higher than P × 1, which is the total execution time of all the successful retries within the period.
The difference between the period (that lasts 1+ q+ r+ f) and P , reduced by r (so that we obtain
an integer), is referred as lagging time in the following. If the threads are numbered according to
their order of success (modulo P ), as the time elapsed between the successes of two given consecutive
8P−1∑
n=0
G(1)n
T0
T1
T2
TP−1
G
(1)
1
G
(1)
2
G
(2)
0
Figure 5: Gaps
threads is constant (during the next period, this time will remain the same), this lagging time can be
seen in a circular manner (see Figure 5): the threads are represented on a circle whose length is the
lagging time increased by r, and the length between two consecutive threads is the time between the
end of the successful retry of the first thread and the begin of the successful retry of the second one.
More formally, for all (n, k) ∈ J0, P − 1K2, we define the gap G(k)n between Tn and its kth predecessor
based on the gap with the first predecessor:{
∀n ∈ J1, P − 1K ; G(1)n = Sn − Sn−1 − 1
G
(1)
0 = S0 + q + r + f − SP−1
,
which leads to the definition of higher order gaps:
∀n ∈ J0, P − 1K ; ∀k > 0 ; G(k)n = n∑
j=n−k+1
G
(1)
j mod P .
For consistency, for all n ∈ J0, P − 1K, G(0)n = 0.
Equally, the gaps can be obtained directly from the successes: for all k ∈ J1, P − 1K,
G(k)n =
{
Sn − Sn−k − k if n > k
Sn − SP+n−k + 1 + q + r + f − k otherwise (2)
Note that, in an (f, P )-cyclic execution, the lagging time is the sum of all first order gaps, reduced
by r.
Now we extend the concept of well-formed seed to weakly-formed seed.
Definition 4. Let S = (Ti, Si)i∈J0,P−1K be a seed.
S is a weakly-formed seed for P threads if and only if: (Ti, Si)i∈J0,P−2K is a well-formed seed for
P − 1 threads, and the first thread succeeding after TP−2 is TP−1.
Property 3. Let S = (Ti, Si)i∈J0,P−1K be a weakly-formed seed.
Denoting f = f
(
(Ti, Si)i∈J0,P−2K), for each n ∈ J0, P − 1K, G(f)n < 1.
Proof: We have SP−2 + 1 < SP−1 < Rf0 , and if we note indeed G˜
(k)
n the gaps within
(Ti, Si)i∈J0,P−2K, the previous well-formed seed with P−1 threads, we know that for all n ∈ J1, P − 2K,
G˜
(1)
n = G(1)n , and G(1)P−1+G
(1)
0 = G˜
(1)
0 , which leads to G
(k)
n ≤ G˜(k)n , for all n ∈ J0, P − 1K and k; hence
the weaker property.
9T0
T1
T2
Figure 6: Lemma 1 configuration
Lemma 1. Let S be a weakly-formed seed, and f = f
(
(Ti, Si)i∈J0,P−2K). If, for all n ∈ J0, P − 1K,
G
(f+1)
n < 1, then there exists later in the execution a well-formed seed S ′ for P threads such that
f (S ′) = f + 1.
Proof: The proof is straightforward; S is actually a well-formed seed such that f (S) = f + 1.
Since Rf0 − SP−1 < G(1)0 < 1, the first success of T0 after the success of TP−1 is its f + 1th retry.
B. Cyclic Executions
Theorem 1. Given a seed S = (Ti, Si)i∈J0,P−1K, S is a well-formed seed if and only if for all
n ∈ J0, P − 1K, 0 ≤ G(f)n < 1.
Proof:
Let S = (Ti, Si)i∈J0,P−1K be a seed.
(⇐) We assume that for all n ∈ J0, P − 1K, 0 < G(f)n < 1, and we first show that the first successes
occur in the following order: T0 at S0, T1 at S1, . . . , TP−1 at SP−1, T0 again at Rf0 . The first threads
that are successful executes their parallel section after their success, then enters their second retry
loop: from this moment, they can make the first attempt of the threads, that has not been successful
yet, fail. Therefore, we will look at which retry of which already successful threads could have an
impact on which other threads.
We can notice that for all n ∈ J0, P − 1K, if the first success of Tn occurs at Sn, then its next
attempts will potentially occur at Rkn = Sn + 1 + q + r + k, where k ≥ 0. More specifically, thanks
to Equation 2, for all n ≤ f , Rkn = SP+n−f +G(f)n + k. Also, for all k ≤ f − n,
Rkn − SP+n−f+k = − (SP+n−f+k − SP+n−f − k) +G(f)n
= G(f)n −G(k)P+n−f+k
Rkn − SP+n−f+k = G(f−k)n , (3)
and this implies that if k > 0,
SP+n−f+k −Rk−1n = 1−G(f−k)n . (4)
We know, by hypothesis, that 0 < G(f−k)n < 1, equivalently 0 < 1 − G(f−k)n < 1. Therefore
Equation 3 states that if a thread Tn′ starts a successful attempt at SP+n−f+k, then this thread will
make the kth retry of Tn fail, since Tn enters a retry while Tn′ is in a successful retry. And Equation 4
shows that, given a thread Tn′ starting a new retry at SP+n−f+k, the only retry of Tn that can make
Tn′ fail on its attempt is the (k − 1)th one. There is indeed only one retry of Tn that can enter a
retry before the entrance of Tn′ , and exit the retry after it.
T0 is the first thread to succeed at S0, because no other thread is in the retry loop at this time.
Its next attempt will occur at R00, and all thread attempts that start before SP−f (included) cannot
fail because of T0, since it runs then the parallel section. Also, since all gaps are positive, the threads
T1 to TP−f will succeed in this order, respectively starting at times S1 to SP−f .
10
Then, using induction, we can show that TP−f+1, . . . , TP−1 succeed in this order, respectively
starting at times SP−f+1, . . . , SP−1. For j ∈ J0, f − 1K, let (Pj) be the following property: for all
n ∈ J0, P − f + jK, Tn starts a successful retry at Sn. We assume that for a given j, (Pj) is true, and
we show that it implies that TP−f+j+1 will succeed at SP−f+j+1. The successful attempt of TP−f+j
at SP−f+j leads, for all j′ ∈ J0, jK, to the failure of the j′th retry of Tj−j′ (explanation of Equation 3).
But for each Tj′ , this attempt was precisely the one that could have made TP−f+j+1 fail on its attempt
at SP−f+j+1 (explanation of Equation 3). Given that all threads Tn, where n > P − f + j + 1, do
not start any retry loop before SP−f+j+1, TP−f+j+1 will succeed at SP−f+j+1. By induction, (Pj) is
true for all j ∈ J0, f − 1K.
Finally, when TP−1 succeeds, it makes the (f − 1−n)th retry of Tn fail, for all n ∈ J0, f − 1K; also
the next potentially successful attempt for Tn is at Rf−nn . (Naturally, for all n ∈ Jf, P − 1K, the next
potentially successful attempt for Tn is at R0n.)
We can observe that for all n < P , j ∈ J0, P − 1− nK, and all k ≥ j,
Rk−jn+j −Rkn = Sn+j + k − j − (Sn + k)
Rk−jn+j −Rkn = G(j)n+j , (5)
hence for all n ∈ J1, fK, Rf−nn −Rf0 = G(n)n > 0.
Rf−nn −Rf0 = G(n)n > 0.
As we have as well, for all n ∈ Jf + 1, P − 1K, R0n > R0f , we obtain that among all the threads,
the earliest possibly successful attempt is Rf0 . Following TP−1, T0 is consequently the next successful
thread in its f th retry.
To conclude this part, we can renumber the threads (Tn+1 becoming now Tn if n > 0, and T0
becoming TP−1), and follow the same line of reasoning. The only difference is the fact that TP−1
(according to the new numbering) enters the retry loop f units of time before SP−1, but it does not
interfere with the other threads, since we know that those attempts will fail.
There remains the case where there exists n ∈ J0, P − 1K such that G(f)n = 0. This implies that
f = 0, thus we have a well-formed seed.
(⇒) We prove now the implication by contraposition; we assume that there exists n ∈ J0, P − 1K
such that G(f)n > 1 or G(f)n < 0, and show that S is not a well-formed seed.
We assume first that an f th order gap is negative. As it is a sum of 1st order gaps, then there
exists n′ such that G(1)n′ is negative; let n′′ be the highest one.
If n′′ > 0, then either the threads T0, . . . , Tn′′−1 succeeded in order at their 0th retry, and then
Tn′′−1 makes Tn′′ fail at its 0th retry (we have a seed, hence by definition, Sn′′−1 < Sn′′ , and G(1)n′′ < 0,
thus Sn′′−1 < Sn′′ < Sn′′−1 + 1 ), or they did not succeed in order at their first try. In both cases, S
is not a well-formed seed.
If n′′ = 0, let us assume that S is a well-formed seed. Let also a new seed be S ′ = (Ti, S′i)i∈J0,P−1K,
where for all n ∈ J0, P − 2K, S′n+1 = Sn, and S′0 = SP−1− (q+1+ f + r). Like S, S ′ is a well-formed
seed; however, G(1)1 is negative, and we fall back into the previous case, which shows that S ′ is not
a well-formed seed. This is absurd, hence S is not a well-formed seed.
We assume now that every gap is positive and choose n0 defined by: n0 = min{n ; ∃k ∈J0, P − 1K /G(k)n+k > 1}, and f0 = min{k ; G(k)n0+k > 1}: among the gaps that exceed 1, we pick
those that concern the earliest thread, and among them the one with the lowest order.
Let us assume that threads T0, . . . , TP−1 succeed at their 0th retry in this order, then T0, . . . , Tn0
complete their second successful retry loop at their f th retry, in this order. If this is not the case, then
S is not a well-formed seed, and the proof is completed. According to Equation 5, we have, on the one
11
hand, Rf0−1n0+1−Rf0n0 = G(1)n0+1, which implies Rf0n0+1−1−Rf0n0 = G(1)n0+1, thus Rfn0+1−(Rfn0+1) = G(1)n0+1;
and on the other hand, R0n0+f0 − Rf0n0 = G
(f0)
n0+f0 implying R
f−f0
n0+f0 −
(
Rfn0 + 1
)
= G(f0)n0+f0 − 1. As
we know that G(f0)n0+f0 − G
(1)
n0+1 = G
(f0−1)
n0+f0 < 1 by definition of f0 (and n0), we can derive that
Rfn0+1 − (Rfn0 + 1) > Rf−f0n0+f0 − (Rfn0 + 1). We have assumed that Tn0 succeeds at its f th retry, which
will end at Rfn0 + 1. The previous inequality states then that Tn0+1 cannot be successful at its f th
retry, since either a thread succeeds before Tn0+f0 and makes both Tn0+f0 and Tn0+1 fail, or Tn0+f0
succeeds and makes Tn0+1 fail. We have shown that S is not a well-formed seed.
Lemma 2. Assuming r 6= 0, if a new thread is added to an (f, P )-cyclic execution, it will eventually
succeed.
Proof:
Let R0P be the time of the 0th retry of the new thread, that we number TP . If this retry is successful,
we are done; let us assume now that this retry is a failure, and let us shift the thread numbers (for
the threads T0, . . . , TP−1) so that T0 makes TP fail on its first attempt. We distinguish two cases,
depending on whether G(P )0 > R0P − S0 or not.
We assume that G(P )0 > R0P − S0. We know that n 7→ G(n)n is increasing on J0, P − 1K and
that G(0)0 = 0, hence let n0 = min{n ∈ J0, P − 1K ; G(n)n < R0P − S0}. For all k ∈ J0, n0K, we have
RkP −Sk = k+R0P −(G(k)k +S0+k) = R0P −S0−G(k)k hence RkP −Sk > 0 and RkP −Sk < R0P −S0 < 1.
This shows that T0, . . . , Tn0 , because of their successes at S0, . . . , Sn0 , successively make 0th, . . . ,
nth0 retries (respectively) of TP fail. The next attempt for TP is at Rn0+1P , which fulfills the following
inequality: Rn0+1P − (Sn0 + 1) < Sn0+1 − (Sn0 + 1) since
Rn0+1P − Sn0+1 = (n0 + 1 +R0P )− (G(n0+1)n0+1 + S0 + n0 + 1)
Rn0+1P − Sn0+1 > 0.
Tn0+1 should have been the successful thread, but TP starts a retry before Sn0+1, and is therefore
succeeding.
We consider now the reverse case by assuming that G(P )0 < R0P − S0. With the previous line of
reasoning, we can show that T0, . . . , TP−1, because of their successes at S0, . . . , SP−1, successively
make 0th, . . . , (P − 1)th retries (respectively) of TP fail. Then we are back in the same situation
when T0 made TP fail for the first time (T0 makes TP fail), except that the success of T0 starts at
S′0 = S0 + G
(P )
0 . As G
(P )
0 = q + r + f − P > 0 and q, f and P are integers, we have that G(P )0 ≥ r.
By the way, if we had G(P )0 > r, we would have G
(P )
0 ≥ 1+ r > R0P − S0, which is absurd. S0 makes
indeed R0P fail, therefore G
(P )
0 should be less than 1. Consequently, we are ensured that G
(P )
0 = r.
We define
k0 =
⌊
R0P − S0
r
⌋
;
also, for every k ∈ J1, k0K, r < R0P − (S0 + k × r) and r > R0P − (S0 + (k0 + 1) × r): the cycle of
successes of T0, . . . , TP−1 is executed k0 times. Then the situation is similar to the first case, and
TP will succeed.
Lemma 3. Let S be a weakly-formed seed, and f = f
(
(Ti, Si)i∈J0,P−2K). If G(f+1)f > 1, and if the
second success of TP−1 does not occur before the second success of Tf−1, then we can find in the
execution a well-formed seed S ′ for P threads such that f (S ′) = f .
Proof:
12
T0
T1
T2
T3
Figure 7: Lemma 3 configuration
Let us first remark that, by the definition of a weakly-formed seed, all threads will succeed once,
in order. Then two ordered groups of threads will compete for each of the next successes, until Tf−1
succeeds for the second time.
Let e be the smallest integer of Jf, P − 1K such that the second success of Te occurs after the
second success of Tf−1. Let then S1 and S2 be the two groups of threads that are in competition,
defined by
S1 = {Tn ; n ∈ J0, f − 1K}
S2 = {Tn ; n ∈ Jf, e− 1K}
For all n ∈ J0, e− 1K, we note
rank (n) =
{
G
(n+1)
n if Tn ∈ S1
G
(n+1)
n − 1 if Tn ∈ S2
.
We define σ, a permutation of J0, e− 1K that describes the reordering of the threads during the
round of the second successes, such that, for all (i, j) ∈ J0, e− 1K2, σ (i) < σ (j) if and only if
rank (i) < rank (j).
We also define a function that will help in expressing the σ−1 (k)’s:
m2 : J0, e− 1K −→ Jf, e− 1K
k 7−→ max {` ∈ Jf, e− 1K ; T` ∈ S2 ; σ (`) ≤ k} .
We note that rank
∣∣J0,f−1K is increasing, as well as rank∣∣Jf,e−1K. This shows that #{T` ∈ S2 ; σ (`) ≤
k} = m2 (k)− (f − 1). Consequently, if Tσ−1(k) ∈ S2, then
m2 (k) = #{T` ∈ S2 ; σ (`) ≤ k}+ f − 1
= #{T` ∈ S2 ; ` ≤ σ−1 (k)}+ f − 1
= σ−1 (k)− f + 1 + f − 1
m2 (k) = σ−1 (k) .
Conversely, if Tσ−1(k) ∈ S1, among {Tσ(n) ; n ∈ J0, kK}, there are exactly m2 (k) − f + 1 threads
in S2, hence
σ−1 (k) = k + 1− (m2 (k)− f + 1)− 1 = f + k −m2 (k)− 1.
In both cases, among {Tσ(n) ; n ∈ J0, kK}, there are exactly m2 (k) − f + 1 threads in S2, and
m1 (k) = k − (m2 (k)− f) threads in S1.
We prove by induction that after this first round, the next successes will be, respectively, achieved
by Tσ−1(0), Tσ−1(1), . . . , Tσ−1(e−1). In the following, by “kth success”, we mean kth success after the
first success of TP−1, starting from 0, and the Rji ’s denote the attempts of the second round.
13
Let (PK) be the following property: for all k ≤ K, the kth success is achieved by Tσ−1(k) at
R
f+k−σ−1(k)
σ−1(k) . We assume (PK) true, and we show that the (K+1)th success is achieved by Tσ−1(K+1)
at Rf+K+1−σ
−1(K+1)
σ−1(K+1) .
We first show that if Tσ−1(K) ∈ S1, then
R
m1(K)−1
m2(K)+1 > R
f+K−σ−1(K)
σ−1(K) > R
m1(K)
m2(K). (6)
On the one hand,
R
f+K−σ−1(K)
σ−1(K) = K − σ−1 (K) +Rfσ−1(K)
= K − σ−1 (K) +Rf0 + σ−1 (K) +G(σ
−1(K))
σ−1(K)
= K + SP−1 + 1 +G(1)0 +G
(σ−1(K))
σ−1(K)
R
f+K−σ−1(K)
σ−1(K) = K + SP−1 + 1 +G
(σ−1(K)+1)
σ−1(K) .
On the other hand,
R
f+K−m2(K)
m2(K) = (m2 (K)− f) +R
K−(m2(K)−f)
f +G
(m2(K)−f)
m2(K)
= (m2 (K)− f) +K − (m2 (K)− f) +R0f +G(m2(K)−f)m2(K)
= (m2 (K)− f) +K − (m2 (K)− f) + SP−1 + 1 + (G(f+1)f − 1) +G(m2(K)−f)m2(K)
R
f+K−m2(K)
m2(K) = K + SP−1 + 1 +G
(m2(K)+1)
m2(K) − 1.
Therefore,
R
f+K−σ−1(K)
σ−1(K) −Rm1(K)m2(K) = R
f+K−σ−1(K)
σ−1(K) −Rf+K−m2(K)m2(K)
= G(σ
−1(K)+1)
σ−1(K) −
(
G
(m2(K)+1)
m2(K) − 1
)
R
f+K−σ−1(K)
σ−1(K) −Rm1(K)m2(K) = rank
(
σ−1 (K)
)
− rank (m2 (K)) .
In a similar way, we can obtain that if Tσ−1(K) ∈ S2, then
R
m2(K)
m1(K) > R
f+K−σ−1(K)
σ−1(K) > R
m2(K)+1
m1(K)−1. (7)
In addition, we recall that if Tσ−1(K) ∈ S2, σ−1 (K) = m2 (K), thus the second inequality of
Equation 6 becomes an equality, and if Tσ−1(K) ∈ S1, σ−1 (K) = f + K − m2 (K) − 1, hence the
second inequality of Equation 7 becomes an equality.
Now let us look at which attempt of other threads Tσ−1(K) made fail. From now on, and until
explicitly said otherwise, we assume that Tσ−1(K) ∈ S1. According to Equation 6, we have
R
m1(K)−1
m2(K)+1 > R
f+K−σ−1(K)
σ−1(K) > R
m1(K)
m2(K)
R
m1(K)−j
m2(K)+j −R
m1(K)−1
m2(K)+1 < R
m1(K)−j
m2(K)+j −R
f+K−σ−1(K)
σ−1(K) < R
m1(K)−j
m2(K)+j −R
m1(K)
m2(K)
G
(j−1)
m2(K)+j < R
m1(K)−j
m2(K)+j −R
f+K−σ−1(K)
σ−1(K) < G
(j)
m2(K)+j
This holds for every j ∈ J1,m1 (K)K, implying j ≤ f , since there could not be more than f threads
in S1. Therefore, as by assumptions gaps of at most f th order are between 0 and 1,
0 < Rm1(K)−jm2(K)+j −R
f+K−σ−1(K)
σ−1(K) < 1;
14
showing that the success of Tσ−1(K) makes thread Tm2(K)+j fail on its attempt at Rm1(K)−jm2(K)+j , for all
j ∈ J1,m1 (K)K.
Since Tσ−1(K) ∈ S1, σ−1 (K) = m1 (K)− 1. Also, for all j ∈ J0, f − 1−m1 (K)K,
R
m2(K)−j
m1(K)+j −R
f+K−σ−1(K)
σ−1(K) = R
m2(K)−j
m1(K)+j −R
m2(K)+1
m1(K)−1
=
(
R
m2(K)−j
m1(K)−1 + (j + 1) +G
(j+1)
m1(K)+j
)
−
(
R
m2(K)−j
m1(K)−1 + (j + 1)
)
R
m2(K)−j
m1(K)+j −R
f+K−σ−1(K)
σ−1(K) = G
(j+1)
m1(K)+j
As a result, Tσ−1(K) makes Tm1(K)+j fail on its attempt at Rm2(K)−jm1(K)+j , for all j ∈ J0, f − 1−m1 (K)K,
and the next attempt will occur at Rm2(K)−j+1m1(K)+j .
Altogether, the next attempt after the end of the success of Tσ−1(K) for Tm1(K)+j is Rm2(K)−j+1m1(K)+j ,
for j ∈ J0, f − 1−m1 (K)K, and for Tm2(K)+j is Rm1(K)−j+1m2(K)+j , for all j ∈ J1,m1 (K)K.
Additionally, a thread will begin a new retry loop, the 0th retry being at R0m2(K)+m1(K)+1 =
R0f+K+1. We note that f +K + 1 could be higher than P − 1, referring to a thread whose number
is more than P − 1. Actually, if n > P − 1, Rjn refers to the jth retry of Trank(n−P+1), after its first
two successes.
The two heads, i.e. the two smallest indices, of S1 ∩ σ−1 (JK + 1, e− 1K) and S2 ∩
σ−1 (JK + 1, e− 1K) will then compete for being successful. Indeed, within S1, for j ∈J0, f − 1−m1 (K)K,
R
m2(K)−j+1
m1(K)+j −R
m2(K)+1
m1(K) = G
(j)
m1(K)+j > 0,
thus if someone succeeds in S1, it will be Tm1(K). In the same way, for all j ∈ J1,m1 (K) + 1K,
R
m1(K)−j+1
m2(K)+j −R
m1(K)
m2(K)+1 = G
(j−1)
m2(K)+j > 0,
meaning that if someone succeeds in S2, it will be Tm2(K)+1.
Let us compare now those two candidates:
R
m2(K)+1
m1(K) −R
m1(K)
m2(K)+1 = m2 (K) + 1− f + SP−1 +m1 (K) +G
(m1(K)+1)
m1(K)
−
(
m1 (K) +R0f +m2 (K) + 1− f +G(m2(K)+1−f)m2(K)+1
)
= SP−1 − 1 +G(m1(K)+1)m1(K)
−
(
SP−1 +G(f+1)f − 1 +G(m2(K)+1−f)m2(K)+1
)
= G(m1(K)+1)m1(K) −
(
G
(m2(K)+2)
m2(K)+1 − 1
)
R
m2(K)+1
m1(K) −R
m1(K)
m2(K)+1 = rank (m1 (K))− rank (m2 (K) + 1) .
By definition, σ−1 (K + 1) is either m1 (K) or m2 (K) + 1 and corresponds to the next successful
thread. We can follow the same line of reasoning in the case where Tσ−1(K) ∈ S2 and prove in this
way that (PK+1) is true.
(P0) is true, and the property spreads until (Pe−1), where all threads of S1 and S2 have been
successful, in the order ruled by σ−1, i.e. Tσ−1(0), . . . , Tσ−1(e−1). And before those successes the
threads Te−1 =Tσ−1(e−1), . . . , TP−1 have been successful as well. The seed composed of those successes
is a well-formed seed. Given a thread, the gap between this thread and the next one in the new order
could indeed not be higher than the gap in the previous order with its next thread. Also the f th
order gaps remain smaller than 1. And as Te−1 succeeds the second time after f failures, it means
that the new seed S ′′ is such that f (S ′′) = f .
15
T0
T1
T2
T3
Figure 8: Lemma 4 configuration
Lemma 4. Let S be a weakly-formed seed, and f = f
(
(Ti, Si)i∈J0,P−2K). If G(f+1)f > 1 and if the
second success of TP−1 occurs before the second success of Tf−1, then we can find in the execution a
well-formed seed S ′ for P threads such that f (S ′) = f .
Proof: Until the second success of TP−1, the execution follows the same pattern as in Lemma 3.
Actually, the case invoked in the current lemma could have been handled in the previous lemma,
but it would have implied tricky notations, when we referred to Trank(n−P+1). Let us deal with this
case independently then, and come back to the instant where TP−1 succeeds for the second time.
We had 0 < R0f−1 − SP−1 = G(f)f−1 < 1. For the thread Tσ(j) to succeed at its kth retry after
the first success of TP−1 and before Tf−1, it should necessary fill the following condition: j + 1 <
Rkσ(j) − SP−1 < j + 1 + G(f)f−1. This holds also for the second success of TP−1, which implies that
P ′ < SP−1 + 1+ q+ r+ h− SP−1 < P ′ +G(f)f−1, where h is the number of failures of TP−1 before its
second success and P ′ is the number of successes between the two successes of TP−1. As G(f)f−1 < 1,
and q, P ′ and h are non-negative integers, we have r < G(f)f−1 and h = P ′ − 1− q.
To conclude, as any gap at any order is less than the gap between the two successes of TP−1,
which is r < 1, we found a well-formed seed for P ′ threads.
Finally any other thread will eventually succeed (see Lemma 2). We can renumber the threads such
that TP ′ is the first thread that is not in the well-formed seed to succeed, and the threads of the well-
formed seed succeeded previously as T0, . . . , TP ′−1. As explained before, for all (k, n) ∈ J0, P ′ − 1K2,
G
(k)
n < G
(n)
n = r. With the new thread, the first order gaps are changed by decomposing G(1)0 into
G
(1)
P ′ and the new G
(1)
0 . All gaps can only be decreased, hence we have a new well-formed seed for
P ′ + 1 threads. We repeat the process until all threads have been encountered, and obtain in the
end S ′, a well-formed seed with P threads such that f (S ′) = P − 1− q, which is an optimal cyclic
execution.
Still, as Tf succeeds between two successes of TP−1 that are separated by r, we had, in the initial
configuration: G(P−1−f)P−1 < r. As, in addition, we have both G
(f)
f−1 < 1 and G
(1)
f < 1, we conclude that
the lagging time was initially less than 2+r. By hypothesis, we know that G(f+1)f > 1, which implies
that, before the entry of the new thread, the lagging time was 1+ r. In the final execution with one
more thread, the lagging time is r and we have one more success in the cycle, thus f (S ′) = f .
Theorem 2. Assuming r 6= 0, if a new thread is added to an (f, P − 1)-cyclic execution, then all the
threads will eventually form either an (f, P )-cyclic execution, or an (f + 1, P )-cyclic execution.
Proof: According to Lemma 2, the new thread will eventually succeed. In addition, we recall
that Properties 1 and 2 ensure that before the first success of the new thread, any set of P − 1
16
consecutive successes is a well-formed seed with P − 1 threads. We then consider a seed (we number
the threads accordingly, and number the new thread as TP−1) such that the success of the new
thread occurs between the success of TP−2 and T0; we obtain in this way a weakly-formed seed
S = (Tn, Sn)n∈J0,P−1K&. We differentiate between two cases.
Firstly, if for all n ∈ J0, P − 1K, G(f+1)n < 1, according to Lemma 1, we can find later in the
execution a well-formed seed S ′ for P threads such that f (S ′) = f + 1, hence we reach eventually
an (f + 1, P )-cyclic execution.
Let us assume now that this condition is not fulfilled. There exists n0 ∈ J0, P − 1K such that
G
(f+1)
n0 > 1. We shift the thread numbers, such that n0 is now f , and we have then G
(f+1)
f > 1.
Then two cases are feasible. If the second success of TP−1 occurs before the second success of Tf−1,
then Lemma 3 shows that we will reach an (f, P )-cyclic execution. Otherwise, from Lemma 3, we
conclude that an (f, P )-cyclic execution will still occur.
C. Throughput Bounds
Firstly we calculate the expression of throughput and the expected number of threads inside the
retry loop (that is needed when we gather expansion and wasted retries). Then we exhibit upper
and lower bounds on both throughput and the number of failures, and show that those bounds are
reached. Finally, we give the worst case on the number of wasted retries.
Lemma 5. In an (f, P )-cyclic execution, the throughput is
T = P
q + r + 1 + f . (8)
Proof: By definition, the execution is periodic, and the period lasts q + r+ 1+ f units of time.
As P successes occur during this period, we end up with the claimed expression.
Lemma 6. In an (f, P )-cyclic execution, the average number of threads Prl in the retry loop is given
by
Prl = P × f + 1
q + r + f + 1 .
Proof: Within a period, each thread spends f + 1 units of time in the retry loop, among the
q + r + f + 1 units of time of the period, hence the Lemma.
Lemma 7. The number of failures is not less than f (-), where
f (-) =
{
P − q − 1 if q ≤ P − 1
0 otherwise , and accordingly, T ≤
{
P
P+r if q ≤ P − 1
P
q+r+1 otherwise.
(9)
Proof: According to Equation 8, the throughput is maximized when the number of failures
is minimized. In addition, we have two lower bounds on the number of failures: (i) f ≥ 0, and
(ii) P successes should fit within a period, hence q + 1 + f ≥ P . Therefore, if P − 1 − q < 0,
T ≤ P/(q + r + 1 + 0), otherwise,
T ≤ P
q + r + 1 + P − 1− q =
P
P + r .
Remark 3. We notice that if q > P − 1, the upper bound in Equation 9 is actually the same as
the immediate upper bound described in Section III-B1. However, if q ≤ P − 1, Equation 9 refines
the immediate upper bound.
17
Lemma 8. The number of failures is bounded by
f ≤ f (+) =
⌊1
2
(
(P − 1− q − r) +
√
(P − 1− q − r)2 + 4P
)⌋
, and accordingly,
the throughput is bounded by
T ≥ P
q + r + 1 + f (+) .
Proof: We show that a necessary condition so that an (f, P )-cyclic execution, whose lagging
time is `, exists, is f × (` + r) < P . According to Property 1, any set of P consecutive successes
is a well-formed seed with P threads. Let S be any of them. As we have f failures before success,
Theorem 1 ensures that for all n ∈ J0, P − 1K, G(f)n < 1. We recall that for all n ∈ J0, P − 1K, we
also have G(P )n = `+ r.
On the one hand, we have
P−1∑
n=0
G(f)n =
P−1∑
n=0
n∑
j=n−f+1
G
(1)
j mod P
= f ×
P−1∑
n=0
G(1)n
P−1∑
n=0
G(f)n = f × (`+ r).
On the other hand,
∑P−1
n=0 G
(f)
n <
∑P−1
n=0 1 = P .
Altogether, the necessary condition states that f × (` + r) < P , which can be rewritten as f ×
(q + 1 + f − P + r) < P . The proof is complete since minimizing the throughput is equivalent to
maximizing the number of failures.
Lemma 9. For each of the bounds defined in Lemmas 7 and 8, there exists an (f, P )-cyclic execution
that reaches the bound.
Proof: According to Lemmas 7 and 8, if an (f, P )-cyclic execution exists, then the number of
failures is such that f (-) ≤ f ≤ f (+).
We show now that this double necessary condition is also sufficient. We consider f such that f (-) ≤
f ≤ f (+), and build a well-formed seed S = (Ti, Si)i∈J0,P−1K.
For all n ∈ J0, P − 1K, we define Si as
Sn = n×
(
q + 1 + f − P + r
P
+ 1
)
.
We first show that f (S) = f . By definition, f (S) = max (0, dSP−1 − S0 − q − re); we have then
f (S) = max
(
0,
⌈
(P − 1)×
(
q + 1 + f − P + r
P
+ 1
)
− q − r
⌉)
= max
(
0,
⌈
(P − 1− q − r) + (q + 1 + f − P + r)− q + 1 + f − P + r
P
⌉)
f (S) = max
(
0,
⌈
f − q + 1 + f − P + r
P
⌉)
.
Firstly, we know that q+1+ f −P ≥ 0, thus if f = 0, then the second term of the maximum is not
positive, and f (S) = 0 = f . Secondly, if f > 0, then according to Lemma 7, (q+1+f −P + r)/P <
1/f ≤ 1. As we also have (q+1+f−P +r)/P ≥ 0, we conclude that f (S) =
⌈
f − q+1+f−P+rP
⌉
= f .
18
Additionally, for all n ∈ J0, P − 1K,
G(f)n =
{
Sn − Sn−f − f if n > f
Sn − SP+n−f + 1 + q + r otherwise
=
 n×
(
q+1+f−P+r
P + 1
)
− (n− f)×
(
q+1+f−P+r
P + 1
)
− f
n×
(
q+1+f−P+r
P + 1
)
− (P + n− f)×
(
q+1+f−P+r
P + 1
)
+ 1 + q + r
=
{
f × q+1+f−P+rP
−(P − f)− (q + 1 + f − P + r) + f × q+1+f−P+rP + 1 + q + r
G(f)n = f ×
w + r
P
As w ≤ 0 and f ≤ 0, G(f)n > 0. Since f ≤ f (+), G(f)n < 1. Theorem 1 implies that S is a well-formed
seed that leads to an (f, P )-cyclic execution.
We have shown that for all f such that f (-) ≤ f ≤ f (+) there exists an (f, P )-cyclic execution; in
particular there exist an (f (+), P )-cyclic execution and an (f (-), P )-cyclic execution.
Corollary 1. The highest possible number of wasted repetitions is
⌈√
P − 1
⌉
and is achieved when
P = q + 1.
Proof:
The highest possible number of wasted repetitions w˜(P ) with P threads is given by
w˜(P ) = f (+) − f (-) =
⌊1
2
(
−a(P ) +
√
a(P )2 + 4P
)
− f (-)
⌋
.
Let a and h be the functions respectively defined as a(P ) = q+1−P+r, which implies a′(P ) = −1,
and h(P ) = (−a(P ) +√a(P )2 + 4P )/2− f (-), so that w˜(P ) = bh(P )c.
Let us first assume that a(P ) > 0. In this case, q ≤ P − 1, hence f (-) = 0. We have
2h′(P ) = 1 + −2a(P ) + 4
2
√
a(P )2 + 4P
2h′(P ) = 2× 2− a(P ) +
√
a(P )2 + 4P
2
√
a(P )2 + 4P
Therefore, h′(P ) is negative if and only if
√
a(P )2 + 4P < a(P )− 2. It cannot be true if a(P ) < 2.
If a(P ) ≥ 2, then the previous inequality is equivalent to a(P )2 + 4P < a(P )2 − 4a(P ) + 4, which
can be rewritten in q + 1+ r < 1, which is absurd. We have shown that h is increasing in ]0, q + 1].
Let us now assume that a(P ) ≤ 0. In this case, q > P − 1, hence f (-) = P − q − 1, and
h′(P ) =
(
a(P ) +
√
a(P )2 + 4P
)
/2 − r. Assuming h′(P ) to be positive leads to the same absurd
inequality q + 1 + r < 1, which proves that h is decreasing on [q + 2,+∞[.
Also, the maximum number of wasted repetitions is achieved as P = q + 1 or P = q + 2. Since
h(q + 1) = 12
(
−r +
√
r2 + 4P
)
>
1
2
(
−(r + 1) +
√
r2 + 4P
)
= h(q + 2),
the maximum number of wasted repetitions is w˜(q + 1). In addition,
1
2
(
−r +√4P
)
< h(q + 1) < 12
(
−r +
√
r2 +
√
4P
)
√
P − r2 < h(q + 1) <
√
P
√
P − 1 ≤ h(q + 1) < √P
19
We conclude that the maximum number of wasted repetitions is
⌈√
P − 1
⌉
.
V. Expansion and Complete Throughput Estimation
A. Expansion
Interference of threads does not only lead to logical conflicts but also to hardware conflicts which
impact the performance significantly. We model the behavior of the cache coherency protocols which
determine the interaction of overlapping Reads and CASs. By taking MESIF [GH09] as basis, we
come up with the following assumptions. When executing an atomic CAS, the core gets the cache line
in exclusive state and does not forward it to any other requesting core until the instruction is retired.
Therefore, requests stall for the release of the cache line which implies serialization. On the other
hand, ongoing Reads can overlap with other operations. As a result, a CAS introduces expansion
only to overlapping Read and CAS operations that start after it, as illustrated in Figure 4. As a
remark, we ignore memory bandwidth issues which are negligible for our study.
Furthermore, we assume that Reads that are executed just after a CAS do not experience expansion
(as the thread already owns of the data), which takes effect at the beginning of a retry following a
failing attempt. Thus, read expansions need only to be considered before the 0th retry. In this sense,
read expansion can be moved to parallel section and calculated in the same way as CAS expansion
is calculated.
To estimate expansion, we consider the delay that a thread can introduce, provided that there is
already a given number of threads in the retry loop. The starting point of each CAS is a random
variable which is distributed uniformly within an expanded retry. The cost function d provides the
amount of delay that the additional thread introduces, depending on the point where the starting
point of its CAS hits. By using this cost function we can formulate the expansion increase that each
new thread introduces and derive the differential equation below to calculate the expansion of a
CAS.
Lemma 10. The expansion of a CAS operation is the solution of the following system of equations:
 e
′ (Prl) = cc ×
cc
2 + e (Prl)
rc + cw + cc + e (Prl)
e
(
P
(0)
rl
)
= 0
,
where P (0)rl is the point where
expansion begins.
Proof:
We compute e (Prl + h), where h ≤ 1, by assuming that there are already Prl threads in the retry
loop, and that a new thread attempts to CAS during the retry, within a probability h.
20
e (Prl + h) = e (Prl) + h×
∫ rlw(+)
0
d (t)
rlw(+)
dt
= e (Prl) + h×
( ∫ rc+cw−cc
0
d (t)
rlw(+)
dt
+
∫ rc+cw
rc+cw−cc
d (t)
rlw(+)
dt
+
∫ rc+cw+e(Prl)
rc+cw
d (t)
rlw(+)
dt
+
∫ rlw(+)
rc+cw+e(Prl)
d (t)
rlw(+)
dt
)
= e (Prl) + h×
( ∫ rc+cw
rc+cw−cc
t
rlw(+)
dt
+
∫ rc+cw+e(Prl)
rc+cw
cc
rlw(+)
dt
)
e (Prl + h) = e (Prl) + h×
cc2
2 + e (Prl)× cc
rlw(+)
This leads to e (Prl + h)− e (Prl)
h
=
cc2
2 + e (Prl)× cc
rlw(+)
. When making h tend to 0, we finally
obtain
e′ (Prl) = cc ×
cc
2 + e (Prl)
rc + cw + cc + e (Prl)
.
B. Throughput Estimate
There remains to combine hardware and logical conflicts in order to obtain the final upper and
lower bounds on throughput. We are given as an input an expected number of threads Prl inside
the retry loop. We firstly compute the expansion accordingly, by solving numerically the differential
equation of Lemma 10. As explained in the previous subsection, we have pw(+) = pw + e, and
rlw(+) = rc + cw + e + cc. We can then compute q and r, that are the inputs (together with the
total number of threads P ) of the method described in Section IV. Assuming that the initialization
times of the threads are spaced enough, the execution will superimpose an (f, P )-cyclic execution.
Thanks to Lemma 6, we can compute the average number of threads inside the retry loop, that we
note by hf (Prl). A posteriori, the solution is consistent if this average number of threads inside the
retry loop hf (Prl) is equal to the expected number of threads Prl that has been given as an input.
Several (f, P )-cyclic executions belong to the domain of the possible outcomes, but we are
interested in upper and lower bounds on the number of failures f . We can compute them through
Lemmas 7 and 8, along with their corresponding throughput and average number of threads inside
the retry loop. We note by h(+)(Prl) and h(-)(Prl) the average number of threads for the lowest
number of failures and highest one, respectively. Our aim is finally to find P (-)rl and P
(+)
rl , such that
h(+)(P (+)rl ) = P
(+)
rl and h(-)(P
(-)
rl ) = P
(-)
rl . If several solutions exist, then we want to keep the smallest,
since the retry loop stops to expand when a stable state is reached.
Note that we also need to provide the point where the expansion begins. It begins when we start to
have failures, while reducing the parallel section. Thus this point is (2P−1)rlw(-) (resp. (P−1)rlw(-))
for the lower (resp. upper) bound on the throughput.
21
Theorem 3. Let (xn) be the sequence defined recursively by x0 = 0 and xn+1 = h(+)(xn). If pw ≥
rc + cw + cc, then
P
(+)
rl = limn→+∞xn.
Proof: First of all, the average number of threads belongs to ]0, P [, thus for all x ∈ [0, P ],
0 < h(+)(x) < P . In particular, we have h(+)(0) > 0, and h(+)(P ) < P , which proves that there exist
one fixed point for h(+).
In addition, we show that h(+) is a non-decreasing function. According to Lemma 6,
h(+)(Prl) = P × 1 + f
(-)
q + r + f (-) + 1 ,
where all variables except P depend actually on Prl. We have
q =
⌊ pw + e
rlw(-) + e
⌋
and r = pw + e
rlw(-) + e
− q,
hence, if pw ≥ rlw(-), q and r are non-increasing as e is non-decreasing, which is non-decreasing
with Prl. Since f (-) is non-decreasing as a function of q, we have shown that if pw ≥ rlw(-), h(+) is
a non-decreasing function.
Finally, the proof is completed by the theorem of Knaster-Tarski.
The same line of reasoning holds for h(-) as well. As a remark, w point out that when pw < rlw(-),
we scan the interval of solution, and have no guarantees about the fact that the solution is the
smallest one; still it corresponds to very extreme cases.
C. Several Retry Loops
We consider here a lock-free algorithm that, instead of being a loop over one parallel section and
one retry loop, is composed of a loop over a sequence of alternating parallel sections and retry loops.
We show that this algorithm is equivalent to an algorithm with only one parallel section and one
retry loop, by proving the intuition that the longest retry loop is the only one that fails and hence
expands.
1) Problem Formulation: In this subsection, we consider an execution such that each spawned
thread runs Procedure Combined in Figure 9. Each thread executes a linear combination of S
independent retry loops, i.e. operating on separate variables, interleaved with parallel sections.
We note now as rlw(+)i and pw
(+)
i the size of a retry of the ith retry loop and the size of the ith
parallel section, respectively, for each i ∈ J1, SK. As previously, qi and ri are defined such that
pw(+)i = (qi + ri)× rlw(+)i , where qi is a non-negative integer and ri is smaller than 1.
Procedure Combined
1 Initialization();
2 while ! done do
3 for i ← 1 to S do
4 Parallel_Work(i);
5 while ! success do
6 current ← Read(AP[i]);
7 new ← Critical_Work(i,current);
8 success ← CAS(AP, current, new);
Figure 9: Thread procedure with several retry loops
22
The Procedure Combined executes the retry loops and parallel sections in a cyclic fashion, so we
can normalize the writing of this procedure by assuming that a retry of the 1st retry loop is the
longest one. More precisely, we consider the initial algorithm, and we define i0 as
i0 = min argmaxi∈J1,SK rlw(+)i .
We then renumber the retry loops such that the new ordering is i0, . . . , S, 1, . . . , i0 − 1, and we add
in Initialization the first parallel sections and retry loops on access points from 1 to i0 — according
to the initial ordering.
One success at the system level is defined as one success of the last CAS, and the throughput is
defined accordingly. We note that in steady-state, all retry loops have the same throughput, so the
throughput can be computed from the throughput of the 1st retry loop instead.
2) Wasted Retries:
Lemma 11. Unsuccessful retry loops can only occur in the 1st retry loop.
Proof:
We note (tn)n∈[1,+∞[ the sequence of the thread numbers that succeeds in the 1st retry loop, and
(sn)n∈[1,+∞[ the sequence of the corresponding time where they exit the retry loop. We notice that
by construction, for all n ∈ [1,+∞[, sn < sn+1. Let, for i ∈ J2, SK and n ∈ [1,+∞[, (Pi,n) be the
following property: for all i′ ∈ J2, iK, and for all n′ ∈ J1, nK, the thread Ttn′ succeeds in the ith retry
loop at its first attempt.
We assume that for a given (i, n), (Pi+1,n) and (Pi,n+1) is true, and show that (Pi+1,n+1) is true.
As the threads Ttn and Ttn+1 do not have any failure in the first i retry loops, their entrance time in
the i+ 1th retry loop is given by
sn +
i∑
i′=1
(rlw(+)i′ + pw
(+)
i′ ) + pw
(+)
i+1 = X1 and sn+1 +
i∑
i′=1
(rlw(+)i′ + pw
(+)
i′ ) + pw
(+)
i+1 = X2,
respectively. Thread Ttn does not fail in the i+ 1th retry loop, hence exits at
X1 + rlw(+)i+1 < X1 + rlw
(+)
1 = sn +X2 − sn+1 < X2.
As the previous threads Tn−1, . . . , T1 exits the ith retry loop before Tn, and next threads Tn′ , where
n′ > n+1, enters this retry loop after Tn+1, this implies that the thread Ttn+1 succeeds in the i+1th
retry loop at its first attempt, and (Pi+1,n+1) is true.
Regarding the first thread that succeeds in the first retry loop, we know that he successes in any
retry loop since there is no other thread to compete with. Therefore, for all i ∈ J2, SK, (Pi,1) is
true. Then we show by induction that all (P2,n) is true, then all (P3,n), etc., until all (PS,n), which
concludes the proof.
Theorem 4. The multi-retry loop Procedure Combined is equivalent to the Procedure Abstract-
Algorithm, where
pw(+) = pw(+)1 +
S∑
i=2
(
pw(+)i + rlw
(+)
i
)
and rlw(+) = rlw(+)1 .
Proof: According to Lemma 11 there is no failure in other retry loop than the first one; therefore,
all retry loops have a constant duration, and can thus be considered as parallel sections.
23
3) Expansion: The expansion in the retry loop starts as threads fail inside this retry loop. When
threads are launched, there is no expansion, and Lemma 11 implies that if threads fail, it should be
inside the first retry loop, because it is the longest one. As a result, there will be some stall time
in the memory accesses of this first retry loop, i.e. expansion, and it will get even longer. Failures
will thus still occur in the first retry loop: there is a positive feedback on the expansion of the first
retry loop that keeps this first retry loop as the longest one among all retry loops. Therefore, in
accordance to Theorem 4, we can compute the expansion by considering the equivalent single-retry
loop procedure described in the theorem.
VI. Experimental Evaluation
We validate our model and analysis framework through successive steps, from synthetic tests,
capturing a wide range of possible abstract algorithmic designs, to several reference implementations
of extensively studied lock-free data structure designs that include cases with non-constant parallel
section and retry loop.
A. Setting
We have conducted experiments on an Intel ccNUMA workstation system. The system is composed
of two sockets, that is equipped with Intel Xeon E5-2687W v2 CPUs with frequency band 1.2-3.4.GHz
The physical cores have private L1, L2 caches and they share an L3 cache, which is 25MB. In a
socket, the ring interconnect provides L3 cache accesses and core-to-core communication. Due to
the bi-directionality of the ring interconnect, uncontended latencies for intra-socket communication
between cores do not show significant variability.
Our model assumes uniformity in the CAS and Read latencies on the shared cache line. Thus,
threads are pinned to a single socket to minimize non-uniformity in Read and CAS latencies. In the
experiments, we vary the number of threads between 4 and 8 since the maximum number of threads
that can be used in the experiments are bounded by the number of physical cores that reside in one
socket.
In all figures, y-axis provides the throughput, which is the number of successful operations
completed per millisecond. Parallel work is represented in x-axis in cycles. The graphs contain the
high and low estimates (see Section IV), corresponding to the lower and upper bound on the wasted
retries, respectively, and an additional curve that shows the average of them.
As mentioned before, the latencies of CAS and Read are parameters of our model. We used the
methodology described in [DGT13] to measure latencies of these operations in a benchmark program
by using two threads that are pinned to the same socket. The aim is to bring the cache line into the
state used in our model. Our assumption is that the Read is conducted on an invalid line. For CAS,
the state of the cache line could be exclusive, forward, shared or invalid. Regardless of the state
of the cache line, CAS requests it for ownership, that compels invalidation in other cores, which in
turn incurs a two-way communication and a memory fence afterwards to assure atomicity. Thus, the
latency of CAS does not show negligible variability with respect to the state of the cache line, as
also revealed in our latency benchmarks.
As for the computation cost, the work inside the parallel section is implemented by a dummy
for-loop of Pause instructions.
B. Synthetic Tests
1) Single retry loop: For the evaluation of our model, we first create synthetic tests that emulate
different design patterns of lock-free data structures (value of cw) and different application contexts
(value of pw). As described in the previous subsection, in the Procedure AbstractAlgorithm, the
24
cw = 50, threads = 4 cw = 50, threads = 6 cw = 50, threads = 8
cw = 100, threads = 4 cw = 100, threads = 6 cw = 100, threads = 8
cw = 200, threads = 4 cw = 200, threads = 6 cw = 200, threads = 8
cw = 600, threads = 4 cw = 600, threads = 6 cw = 600, threads = 8
cw = 1600, threads = 4 cw = 1600, threads = 6 cw = 1600, threads = 8
4000
6000
8000
10000
12000
4000
6000
8000
10000
12000
4000
6000
8000
10000
12000
5000
7000
9000
11000
5000
7000
9000
11000
5000
7000
9000
11000
4000
6000
8000
4000
6000
8000
4000
6000
8000
2000
3000
4000
2000
3000
4000
2000
3000
4000
1000
1500
1000
1500
1000
1500
1000 2000 3000 0 1000 2000 3000 4000 0 2000 4000 6000
0 1000 2000 3000 0 2000 4000 0 2000 4000 6000
0 1000 2000 3000 4000 5000 0 2000 4000 6000 0 2500 5000 7500 10000
0 2500 5000 7500 10000 0 5000 10000 15000 0 5000 10000 15000 20000
0 5000 10000 15000 20000 0 10000 20000 30000 0 10000 20000 30000 40000
Parallel Work (cycles)
Th
ro
ug
hp
ut
 (o
ps
/m
se
c)
Case Low High Average Real
Figure 10: Synthetic program
25
amount of work in both the parallel section and the retry loop are implemented as dummy loops,
whose costs are adjusted through the number of iterations in the loop.
Generally speaking, in Figure 10, we observe two main behaviors: when pw is high, the data
structure is not contended, and threads can operate without failure. When pw is low, the data
structure is contended, and depending on the size of cw (that drives the expansion) a steep decrease
in throughput or just a roughly constant bound on the performance is observed.
The position of the experimental curve between the high and low estimates, depends on cw. It can
be observed that the experimental curve mostly tends upwards as cw gets smaller, possibly because
the serialization of the CASs helps the synchronization of the threads.
Another interesting fact is the waves appearing on the experimental curve, especially when the
number of threads is low or the critical work big. This behavior is originating because of the variation
of r with the change of parallel work, a fact that is captured by our analysis.
cw1=50 cw1=200 cw1=400 cw1=1000
0
10
20
30
40
0
5
10
15
0.0
2.5
5.0
7.5
cw
2=50
cw
2=400
cw
2=1000
0 2000 4000 6000 0 2500 5000 7500 100000 5000 10000 150000 10000 20000 30000
Parallel Work + Small Retry Loop (cycles)
Norm. Success Fails RL1/Success Fails RL2/Success Total Fails / Success Low High Average
Figure 11: Multiple retry loops with 8 threads
2) Several retry loops: We have created experiments by combining several retry loops, each
operating on an independent variable which is aligned to a cache line. In Figure 11, results are
compared with the model for single retry loop case where the single retry loop is equal to the
longest retry loop, while the other retry loops are part of the parallel section. The distribution of
fails in the retry loops are illustrated and all throughput curves are normalized with a factor of 175
(to be easily seen in the same graph). Fails per success values are not normalized and a success is
obtained after completing all retry loops.
C. Treiber’s Stack
The lock-free stack by Treiber [Tre86] is one of the most studied efficient data structures. Pop and
Push both contain a retry loop, such that each retry starts with a Read and ends with CAS on the
shared top pointer. In order to validate our model, we start by using Pops. From a stack which is
26
cw = 50, threads = 6 cw = 300, threads = 6 cw = 600, threads = 6
cw = 900, threads = 6 cw = 1200, threads = 6 cw = 1500, threads = 6
4000
6000
8000
10000
12000
2000
3000
4000
5000
6000
7000
2000
3000
4000
1000
1500
2000
2500
3000
1000
1500
2000
2500
1000
1500
2000
0 1000 2000 3000 4000 0 2500 5000 7500 0 5000 10000 15000
0 5000 10000 15000 20000 0 5000 10000 15000 20000 25000 0 10000 20000 30000
Parallel Work (cycles)
Th
ro
ug
hp
ut
 (o
ps
/m
se
c)
Case Low High Average Real
Figure 12: Pop on Treiber’s stack
initiated with 50 million elements, threads continuously pop elements for a given amount of time.
We count the total number of pop operations per millisecond. Each Pop first reads the top pointer
and gets the next pointer of the element to obtain the address of the second element in the stack,
before attempting to CAS with the address of the second element. The access to the next pointer of
the first element occurs in between the Read and the CAS. Thus, it represents the work in cw. This
memory access can possibly introduce a costly cache miss depending on the locality of the popped
element.
To validate our model with different cw values, we make use of this costly cache miss possibility.
We allocate a contiguous chunk of memory and align each element to a cache line. Then, we initialize
the stack by pushing elements from contiguous memory either with a single or large stride to disable
the prefetcher. When we measure the latency of cw in Pop for single and large stride cases, we obtain
the values that are approximately 50 and 300 cycles, respectively. As a remark, 300 cycles is the
cost of an L3 miss in our system when it is serviced from the local main memory module. To create
more test cases with larger cw, we extended the stack implementation to pop multiple elements with
a single operation. Thus, each access to the next element could introduce an additional L3 cache
miss while popping multiple elements. By doing so, we created cases in which each thread pops 2,
3, etc. elements, and cw goes to 600, 900, etc. cycles, respectively. In Figure 12, comparison of the
experimental results from Treiber’s stack and our model is provided.
As a remark, we did not implemented memory reclamation for our experiments but one can
implement a stack that allows pop and push of multiple elements with small modifications using
hazard pointers [Mic04]. Pushing can be implemented in the same way as single element case. A
27
Algorithm 1: Multiple Pop
1 Pop (multiple)
2 while true do
3 t = Read(top);
4 for multiple do
5 if t = NULL then
6 return EMPTY;
7 hp* = t;
8 if top != t then
9 break;
10 hp++;
11 next = t.next;
12 if CAS(&top, t, next) then
13 break;
14 RetireNodes (t, multiple);
Pop requires some modifications for memory reclamation. It can be implemented by making use of
hazard pointers just by adding the address of the next element to the hazard list before jumping
to it. Also, the validity of top pointer should be checked after adding the pointer to the hazard list
to make sure that other threads are aware of the newly added hazard pointer. By repeating this
process, a thread can jump through multiple elements and pop all of them with a CAS at the end.
D. Shared Counter
In [DLM13], the authors have implemented a “scalable statistics counters” relying on the following
idea: when contention is low, the implementation is a regular concurrent counter with a CAS; when
the counter starts to be contended, it switches to a statistical implementation, where the counter is
actually incremented less frequently, but by a higher value. One key point of this algorithm is the
switch point, which is decided thanks to the number of failed increments; our model can be used by
providing the peak point of performance of the regular counter implementation as the switch point.
We then have implemented a shared counter which is basically a Fetch-and-Increment using a CAS,
and compared it with our analysis. The result is illustrated in Figure 13, and shows that the parallel
section size corresponding to the peak point is correctly estimated using our analysis.
E. DeleteMin in Priority List
We have applied our model to DeleteMin of the skiplist based priority queue designed in [LJ13].
DeleteMin traverses the list from the beginning of the lowest level, finds the first node that is not
logically deleted, and tries to delete it by marking. If the operation does not succeed, it continues
with the next node. Physical removal is done in batches when reaching a threshold on the number of
deleted prefixes, and is followed by a restructuring of the list by updating the higher level pointers,
which is conducted by the thread that is successful in redirecting the head to the node deleted by
itself.
We consider the last link traversal before the logical deletion as critical work, as it continues
with the next node in case of failure. The rest of the traversal is attributed to the parallel section
as the threads can proceed concurrently without interference. We measured the average cost of a
traversal under low contention for each number of threads, since traversal becomes expensive with
28
cw = 0, threads = 4
5000
10000
15000
0 1000 2000 3000 4000 5000
Parallel Work (cycles)
Th
ro
ug
hp
ut
 (o
ps
/m
se
c)
Case Low High Average Real
(a) 4 threads
cw = 0, threads = 6
4000
8000
12000
16000
0 2000 4000
Parallel Work (cycles)
Th
ro
ug
hp
ut
 (o
ps
/m
se
c)
Case Low High Average Real
(b) 6 threads
cw = 0, threads = 8
5000
7500
10000
12500
15000
0 2000 4000
Parallel Work (cycles)
Th
ro
ug
hp
ut
 (o
ps
/m
se
c)
Case Low High Average Real
(c) 8 threads
Figure 13: Increment on a shared counter
more threads. In addition, average cost of restructuring is also included in the parallel section since
it is executed infrequently by a single thread.
We initialize the priority queue with a large set of elements. As illustrated in Figure 14, the smallest
pw value is not zero as the average cost of traversal and restructuring is intrinsically included. The
peak point is in the estimated place but the curve does not go down sharply under high contention.
This presumably occurs as the traversal might require more than one steps (link access) after a failed
attempt, which creates a back-off effect.
F. Enqueue-Dequeue on a Queue
In order to demonstrate the validity of the model with several retry loops, and that the results
covers a wider spectrum of application and designs from the ones we focused in our model, we studied
the following setting: the threads share a queue, and each thread enqueues an element, executes the
parallel section, dequeues an element, and reiterates. We consider the queue implementation by
Michael and Scott [MS96], that is usually viewed as the reference queue while looking at lock-free
queue implementations.
Dequeue operations fit immediately into our model but Enqueue operations need an adjustment due
to the helping mechanism. Note that without this helping mechanism, a simple queue implementation
would fit directly, but we also want to show that the model is malleable, i.e. the fundamental behavior
remains unchanged even if we divert slightly from the initial assumptions. We consider an equivalent
execution that catches up with the model, and use it to approximate the performance of the actual
execution of Enqueue.
29
cw = 50, threads = 4
2500
5000
7500
10000
12500
1000 2000 3000 4000
Parallel Work (cycles)
Th
ro
ug
hp
ut
 (o
ps
/m
se
c)
Case Low High Average Real
(a) 4 threads
cw = 50, threads = 6
5000
7500
10000
12500
1000 2000 3000 4000 5000
Parallel Work (cycles)
Th
ro
ug
hp
ut
 (o
ps
/m
se
c)
Case Low High Average Real
(b) 6 threads
cw = 50, threads = 8
5000
7500
10000
12500
2000 4000 6000 8000
Parallel Work (cycles)
Th
ro
ug
hp
ut
 (o
ps
/m
se
c)
Case Low High Average Real
(c) 8 threads
Figure 14: DeleteMin on a priority list
Enqueue is composed of two steps. Firstly, the new node is attached to the last node of the queue
via a CAS, that we denote by CASA, leading to a transient state. Secondly, the tail is redirected to
point to the new node via another CAS, that we denote by CASB, which brings back the queue into
a steady state.
A new Enqueue can not proceed before the two steps of previous success are completed. The first
step is the linearization point of operation and the second step could be conducted by a different
thread through the helping mechanism. In order to start a new Enqueue, concurrent Enqueues help
the completion of the second step of the last success if they find the queue in the transient state.
Alternatively, they try to attach their node to the queue if the queue is in the steady state at the
instant of check. This process continues until they manage to attach their node to the queue via a
retry loop in which state is checked and corresponding CAS is executed.
The flow of an Enqueue is determined by this state checks. Thus, an Enqueue could execute multiple
CASB (successful or failing) and multiple CASA (failing) in an interleaved manner, before succeeding
in CASA at the end of the last retry. If we assume that both states are equally probable for a check
instant which will then end up with a retry, the number of CAS s that ends up with a retry are
expected to be distributed equally among CASA and CASB for each thread. In addition, each thread
has a successful CASA (which linearizes the Enqueue) and a CASB at the end of the operation which
could either be successful or failed by a concurrent helper thread.
We imitate such an execution with an equivalent execution in which threads keep the same relative
ordering of the invocation, return from Enqueue together with same result. In equivalent execution,
threads alternate between CASA and CASB in their retries, and both steps of successful operation is
30
cw = 225, threads = 4
2000
4000
6000
0 2000 4000 6000
Parallel Work (cycles)
Th
ro
ug
hp
ut
 (o
ps
/m
se
c)
Case Low High Average Constant Poisson
(a) 4 threads
cw = 225, threads = 6
3000
4000
5000
6000
7000
0 2000 4000 6000 8000
Parallel Work (cycles)
Th
ro
ug
hp
ut
 (o
ps
/m
se
c)
Case Low High Average Constant Poisson
(b) 6 threads
cw = 225, threads = 8
2000
3000
4000
5000
6000
7000
8000
0 3000 6000 9000 12000
Parallel Work (cycles)
Th
ro
ug
hp
ut
 (o
ps
/m
se
c)
Case Low High Average Constant Poisson
(c) 8 threads
Figure 15: Enqueue-Dequeue on Michael and Scott queues
conducted by the same thread. The equivalent execution can be obtained by thread-wise reordering
of CAS s that leads to a retry and exchanging successful CASB s with the failed counterparts at
the end of an Enqueue, as the latter ones indeed fail because of this success of helper threads.
The model can be applied to this equivalent execution by attributing each CASA-CASB couple to a
single iteration and represent it as a larger retry loop since the successful couple can not overlap
with another successful one and all overlapping ones fail. With a straightforward extension of the
expansion formula, we accomodate the CASA in the critical work which can also expand, and use
CASB as the CAS of our model.
In addition, we take one step further outside the analysis by including a new case, where the
parallel section follows a Poisson distribution, instead of being constant. pw is chosen as the mean
to generate Poisson distribution instead of taking it constant. The results are illustrated in Figure 15.
Our model provides good estimates for the constant pw and also reasonable results for the Poisson
distribution case, although this case deviates from (/extends) our model assumptions. The advantage
of regularity, which brings synchronization to threads, can be observed when the constant and Poisson
distributions are compared. In the Poisson distribution, the threads start to fail with larger pw, which
smoothes the curve around the peak of the throughput curve.
G. Discussion
In this subsection we discuss the adequacy of our model, specifically the cyclic argument, to
capture the behavior that we observe in practice. Figure 16 illustrates the frequency of occurrence of
31
cw = 4000, threads = 6
0
2
4
6
8
0 10000 20000 30000 40000
Parallel Work (cycles)
0.25 0.50 0.75
Consecutive Fail Frequency
Case Av. Fails per Success Model Average Normalized Throughput
Figure 16: Consecutive Fails Frequency
a given number of consecutive fails, together with average fails per success values and the throughput
values, normalized by a constant factor so that they can be seen on the graph. In the background,
the frequency of occurrence of a given number of consecutive fails before success is presented. As
a remark, the frequency of 6+ fails is gathered with 6. We expect to see a frequency distribution
concentrated around the average fails per success value, within the bounds computed by our model.
While comparing the distribution of failures with the throughput, we could conjecture that the
bumps come from the fact that the failures spread out. However, our model captures correctly the
throughput variations and thus strips down the right impacting factor. The spread of the distribution
of failures indicates the violation of a stable cyclic execution (that takes place in our model), but
in these regions, r actually gets close to 0, as well as the minimum of all gaps. The scattering in
failures shows that, during the execution, a thread is overtaken by another one. Still, as gaps are
close to 0, the imaginary execution, in which we switch the two thread IDs, would create almost the
same performance effect. This reasoning is strengthened by the fact that the actual average number
of failures follows the step behavior, predicted by our model. This shows that even when the real
execution is not cyclic and the distribution of failures is not concentrated, our model that results in
a cyclic execution remains a close approximation of the actual execution.
H. Back-Off Tuning
Together with the analysis comes a natural back-off strategy: we estimate the pw corresponding to
the peak point of the average curve, and when the parallel section is smaller than the corresponding
32
cw = 225, threads = 8
3000
4000
5000
6000
7000
0 2500 5000 7500
Parallel Work (cycles)
Th
ro
ug
hp
ut
 (o
ps
/m
se
c)
Type Exponential Linear New None
Value 0 1 2 4 8 16 32
(a) 8 threads
cw = 225, threads = 4
2000
4000
6000
8000
10000
1000 2000 3000 4000 5000
Parallel Work (cycles)
Th
ro
ug
hp
ut
 (o
ps
/m
se
c)
Type Exponential Linear New None
Value 0 1 2 4 8 16 32
(b) 4 threads
Figure 17: Comparison of back-off schemes for Poisson Distribution
pw, we add a back-off in the parallel section, so that the new parallel section is at the peak point.
We have applied exponential, linear and our back-off strategy to the Enqueue/Dequeue experiment
specified above. Our back-off estimate provides good results for both types of distribution. In
Figure 17 (where the values of back-off are steps of 115 cycles), the comparison is plotted for the
Poisson distribution, which is likely to be the worst for our back-off. Our back-off strategy is better
than the other, except for very small parallel sections, but other back-off strategies should be tuned
for each value of pw.
We obtained the same shapes while removing the distribution law and considering constant values.
The results are illustrated in Figure 18.
cw = 225, threads = 8
3000
4000
5000
6000
7000
0 2500 5000 7500
Parallel Work (cycles)
Th
ro
ug
hp
ut
 (o
ps
/m
se
c)
Type Exponential Linear New None
Value 0 1 2 4 8 16 32
(a) 8 threads
cw = 225, threads = 4
2000
4000
6000
8000
1000 2000 3000 4000 5000
Parallel Work (cycles)
Th
ro
ug
hp
ut
 (o
ps
/m
se
c)
Type Exponential Linear New None
Value 0 1 2 4 8 16 32
(b) 4 threads
Figure 18: Comparison of back-off schemes for constant pw
33
VII. Conclusion
In this paper, we have modeled and analyzed the performance of a general class of lock-free
algorithms, and have so been able to predict the throughput of such algorithms, on actual executions.
The analysis rely on the estimation of two impacting factors that lower the throughput: on the
one hand, the expansion, due to the serialization of the atomic primitives that take place in the
retry loops; on the other hand, the wasted retries, due to a non-optimal synchronization between
the running threads. We have derived methods to calculate those parameters, along with the final
throughput estimate, that is calculated from a combination of these two previous parameters. As
a side result of our work, this accurate prediction enables the design of a back-off technique that
performs better than other well-known techniques, namely linear and exponential back-offs.
As a future work, we envision to enlarge the domain of validity of the model, in order to cope with
data structures whose operations do not have constant retry loop, as well as the framework, so that
it includes more various access patterns. The fact that our results extend outside the model allows us
to be optimistic on the identification of the right impacting factors. Finally, we also foresee studying
back-off techniques that would combine a back-off in the parallel section (for lower contention) and
in the retry loops (for higher robustness).
References
[ACS14] Dan Alistarh, Keren Censor-Hillel, and Nir Shavit. Are lock-free concurrent algorithms practically wait-
free? In David B. Shmoys, editor, Symposium on Theory of Computing (STOC), pages 714–723. ACM,
June 2014.
[AF92] Juan Alemany and Edward W. Felten. Performance issues in non-blocking synchronization on shared-
memory multiprocessors. In Norman C. Hutchinson, editor, Proceedings of the ACM Symposium on
Principles of Distributed Computing (PoDC), pages 125–134. ACM, 1992.
[ARJ97] James H. Anderson, Srikanth Ramamurthy, and Kevin Jeffay. Real-time computing with lock-free shared
objects. ACM Transactions on Computer Systems (TOCS), 15(2):134–165, 1997.
[DB08] Kristijan Dragicevic and Daniel Bauer. A survey of concurrent priority queue algorithms. In Proceedings
of the International Parallel and Distributed Processing Symposium (IPDPS), pages 1–6, April 2008.
[DGT13] Tudor David, Rachid Guerraoui, and Vasileios Trigonakis. Everything you always wanted to know about
synchronization but were afraid to ask. In Michael Kaminsky and Mike Dahlin, editors, Proceedings of the
ACM Symposium on Operating Systems Principles (SOSP), pages 33–48. ACM, November 2013.
[DLM13] Dave Dice, Yossi Lev, and Mark Moir. Scalable statistics counters. In Guy E. Blelloch and Berthold Vöcking,
editors, Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages
43–52. ACM, July 2013.
[GH09] James R. Goodman and Herbert Hing Jing Hum. Mesif: A two-hop cache coherency protocol for point-to-
point interconnects. Technical report, University of Auckland, November 2009.
[HSY10] Danny Hendler, Nir Shavit, and Lena Yerushalmi. A scalable lock-free stack algorithm. Journal of Parallel
and Distributed Computing (JPDC), 70(1):1–12, 2010.
[Int13] Intel. Lock scaling analysis on IntelR© XeonR© processors. Technical Report 328878-001, Intel, April 2013.
[KH14] Alex Kogan and Maurice Herlihy. The future(s) of shared data structures. In Magnús M. Halldórsson and
Shlomi Dolev, editors, Proceedings of the ACM Symposium on Principles of Distributed Computing (PoDC),
pages 30–39. ACM, July 2014.
[LJ13] Jonatan Lindén and Bengt Jonsson. A skiplist-based concurrent priority queue with minimal memory
contention. In Roberto Baldoni, Nicolas Nisse, and Maarten van Steen, editors, Proceedings of the
International Conference on Principle of Distributed Systems (OPODIS), volume 8304 of Lecture Notes
in Computer Science, pages 206–220. Springer, December 2013.
[Mic04] Maged M. Michael. Hazard pointers: Safe memory reclamation for lock-free objects. IEEE Transactions
on Parallel and Distributed Systems (TPDS), 15(6):491–504, 2004.
[MS96] Maged M. Michael and Michael L. Scott. Simple, fast, and practical non-blocking and blocking concurrent
queue algorithms. In James E. Burns and Yoram Moses, editors, Proceedings of the ACM Symposium on
Principles of Distributed Computing (PoDC), pages 267–275. ACM, May 1996.
[SL00] Nir Shavit and Itay Lotan. Skiplist-based concurrent priority queues. In Proceedings of the International
Parallel and Distributed Processing Symposium (IPDPS), pages 263–268, May 2000.
[Tre86] R. Kent Treiber. Systems programming: Coping with parallelism. International Business Machines
Incorporated, Thomas J. Watson Research Center, 1986.
34
[Val94] J. D. Valois. Implementing Lock-Free Queues. In Proceedings of International Conference on Parallel and
Distributed Systems (ICPADS), pages 64–69, December 1994.
