On the analysis of scheduling algorithms for structured parallel
  computations by Rito, Guilherme & Paulino, Hervé
ar
X
iv
:1
81
0.
10
63
2v
1 
 [c
s.D
C]
  2
4 O
ct 
20
18
On the analysis of scheduling algorithms for
structured parallel computations
Guilherme Rito∗1 and Hervé Paulino2
1ETH Zurich, guilherme.teixeira@inf.ethz.ch
2NOVA Laboratory for Computer Science and Informatics,
Departamento de Informática, Faculdade de Ciências e Tecnologia,
Universidade NOVA de Lisboa, herve.paulino@fct.unl.pt
October 26, 2018
Abstract
Algorithms for scheduling structured parallel computations have
been widely studied in the literature. For some time now, Work Steal-
ing is one of the most popular for scheduling such computations, and its
performance has been studied in both theory and practice. Although
it delivers provably good performances, the effectiveness of its under-
lying load balancing strategy is known to be limited for certain classes
of computations, particularly the ones exhibiting irregular parallelism
(e.g. depth first searches). Many studies have addressed this limitation
from a purely load balancing perspective, viewing computations as sets
of independent tasks, and then analyzing the expected amount of work
attached to each processor as the execution progresses. However, these
studies make strong assumptions regarding work generation which, de-
spite being standard from a queuing theory perspective — where work
generation can be assumed to follow some random distribution — do
not match the reality of structured parallel computations — where the
work generation is not random, only depending on the structure of a
computation.
In this paper, we introduce a formal framework for studying the
performance of structured computation schedulers, define a criterion
that is appropriate for measuring their performance, and present a
methodology for analyzing the performance of randomized schedulers.
We demonstrate the convenience of this methodology by using it to
prove that the performance of Work Stealing is limited, and to analyze
the performance of a Work Stealing and Spreading algorithm, which
overcomes Work Stealing’s limitation.
∗Work done while author was at NOVA Laboratory for Computer Science and Infor-
matics, Departamento de Informática, Faculdade de Ciências e Tecnologia, Universidade
NOVA de Lisboa.
1
1 Introduction
The main goal of a structured computation’s scheduler is to guarantee the
fast completion of the execution of arbitrary structured computations. For
some time now, Work Stealing is one of the most popular algorithms for
scheduling structured computations [ABP98, ABP01, BL99, ACR13, BFG03,
CKK+08, DLS+09, HS02, QW10]. In Work Stealing (or WS for short), each
processor owns a deque that uses to keep track of its work. Busy processors
operate locally on their deques, adding and retrieving work from them as
necessary, until they run out of work. When that happens, a processor
becomes a thief and starts a stealing phase, during which it targets other
processors, uniformly at random, in order to steal work from their deques.
As proved in [ABP01, BL99], the expected execution time of any compu-
tation using WS is asymptotically optimal. Nevertheless, WS’s performance
is known to be limited for the execution of computations that exhibit ir-
regular parallelism (e.g. depth first computations where only a few threads
actually generate work) [ACR13, BFG03, CKK+08, DLS+09]. For coping
with this limitation, numerous studies have been resorting to the use of
steal-half deques [ACR13, DLS+09, HS02, QW10] which allow thieves to
take up to half of the work of their victims in a single steal operation. The
adoption of steal-half strategies by real-life schedulers has been mostly jus-
tified by the strategies’ importance on distributed memory environments,
where each steal attempt incurs in significant latency, making it worth to
transfer a larger amount of work in a single steal. On the other hand, the
steal-half strategy has been formally proved, from a queuing theory perspec-
tive, to be an effective load balancing method for schedulers of independent
tasks [BFG03]. However, while this strategy may be ideal for independent
task scheduling from a queuing theory perspective — where tasks are as-
sumed to arrive at a system according to some probability distribution, and
work transfers are assumed to take constant time regardless of the amount of
tasks transferred — it remains unknown whether it is suitable for structured
computation scheduling — where work generation only depends on the struc-
ture of a computation, and where the time for a processor to transfer work
from another processor is proportional to the amount of work transferred —
and so the problem of how to cope with WS’s limitation remains open. Even
more importantly, while there are well established methods for analyzing the
performance of the load balancers of independent task schedulers — usually
based on the analysis of Markov chains — and a well-defined goal — which
is typically to assure that the system’s load does not grow unboundedly over
time — to the best of our knowledge there are no well-defined methods suit-
able for analyzing the performance of the load balancers of online structured
computation schedulers, nor even well-defined goals.
To this extent, the contributions of this paper are:
• A formal framework for studying the performance of structured compu-
2
tation schedulers (Section 2). One of the key features of this framework
is that it can be used to model most, if not all, practical scheduling
algorithms.
• The definition of algorithm short-term stability (Section 2.1), which
is an appropriate criterion for measuring the performance of online
structured computation schedulers.
• A methodology that allows to effectively study the performance of
randomized computation schedulers (Section 3). We demonstrate its
convenience by: 1. using it to prove that the performance of WS is
indeed limited (Section 3.2); and 2. presenting a (purely theoretical)
variant of the WS algorithm where processors attempt to spread work
as it is generated, and then using our methodology to show that the
algorithm overcomes the identified limitations of WS (Section 4). De-
spite being purely theoretical, the algorithm we present gives us insight
on how the limitations of WS can be addressed.
2 A criterion to measure the performance of com-
putation schedulers
Like in much previous work [ABB02, ALHH08, ABP01, BL99, MA16, TGT+10],
we model a computation as a dag G = (V,E), where each node v ∈ V cor-
responds to an instruction, and each edge (µ1, µ2) ∈ E denotes an ordering
constraint (meaning µ2 can only be executed after µ1). Nodes with in-degree
of 0 are referred to as roots, while nodes with out-degree of 0 are called sinks.
We make the two standard assumptions related with the structure of com-
putations. Let G denote a computation’s dag: 1. there is only one root and
one sink in G; and 2. the out-degree of any node within G is at most two.
We consider that processors operate on discrete time steps, each execut-
ing one instruction — that may or may not correspond to a computation
node — per time step. The execution of a computation is carried out by a
set of processors denoted by Procs whose cardinality is denoted by P . We
assume that P ≥ 2 (i.e. Procs is composed by at least two processors), and
that all processors operate synchronously in time steps. Therefore, a com-
putation’s execution can be partitioned into discrete time steps, such that
at each step every processor executes an instruction. We refer to these time
steps using non-negative integers, where 0 is the first step and i + 1 is the
step succeeding i.
Definition 2.1. At any step during a computation’s execution each node
of the computation is in exactly one of the following states: not ready —
if its predecessors have not yet been executed; ready — if its predecessors
have been executed, but not the node itself; and executed — if the node
has been executed.
3
As one may note, a node can only be ready if all the ordering con-
straints wrt (with respect to) the node are satisfied. For example, at the
first step of a computation’s execution every node (except for the root) is
not ready. To ensure the correct execution of a computation, only nodes
that are not ready can become ready, and only nodes that are ready
can become executed. For each step i, refer to the set of nodes that
are: 1. not ready by NonReadyi; 2. ready by Readyi (or simply Ri);
and 3. executed by Executedi. Since only the nodes that are ready can
become executed, Executedi can alternatively be defined as Executedi =(⋃
j∈{1,...,i−1}Rj
)
−Ri (i.e. the set of all nodes that were once ready, but no
longer are).
For each step i, partition Ri into P sets (one per processor), and refer to
Ri(p) — processor p’s partition of Ri — as the set of nodes that are attached
to processor p at step i. Say that a node was enabled at step i if it was not
ready at step i but is ready at step i + 1, and, similarly, that a node was
executed at step i if it was ready at step i but is executed at step i+1. In
addition, say that a node µ is migrated if µ ∈ Ri (p) and µ ∈ Ri+1 (q), where
p 6= q, for p, q ∈ Procs. The next definition formalizes these ideas.
Definition 2.2. For each step i and processor p, define the set of nodes
enabled by p as Ei(p) = Ri+1(p) − Ri and computed (or executed) by p as
Ci(p) = Ri(p)−Ri+1. Moreover, define the set of nodes migrated from p to
all other processors as M+i (p) = Ri(p)∩ (Ri+1−Ri+1(p)) and from all other
processors to p as M−i (p) = Ri+1(p) ∩ (Ri −Ri(p)).
For a set of processors S ∈ P(Procs), define Ri(S) =
⋃
p∈S Ri(p),
Ei(S) =
⋃
p∈S Ei(p), Ci(S) =
⋃
p∈S Ci(p), M
+
i (S) =
⋃
p∈S M
+
i (p), and
M−i (S) =
⋃
p∈SM
−
i (p). Additionally, define Ei = Ei(Procs), Ci = Ci(Procs),
M+i = M
+
i (Procs), and M
−
i = M
−
i (Procs).
Having defined these sets of nodes, we now introduce rounds. Informally,
a round is a sequence of time steps with constant length, such that every
processor executes at most one node and no ready node is migrated more
than once.
Definition 2.3. A round is a sequence of L time steps (for some constant
L ≥ 1) such that a computation’s execution can be partitioned into equal-
length rounds and for every round: 1. no processor executes more than a
single node; and 2. no node is migrated more than once.
Analogously to time steps, refer to rounds using non-negative integers,
but with an additional bar, where 0 denotes the first round. Throughout
this paper, we let L denote the length of rounds and t [i] denote the i-th step
of a round t (for i ∈ {0, . . . , L − 1}). As we will see, the length of rounds
depends on the scheduling algorithm.
4
Now, we reintroduce the concepts we have already presented above con-
cerning the states of nodes, but this time considering rounds rather than
steps.
Definition 2.4. For each round t and processor p, define the set of nodes
attached to p at round t as Rt(p) = Rt[0](p), enabled by p during t as
Et(p) =
⋃
i∈{t[0],··· ,t[L−1]}
Ei(p),
executed by p during t as
Ct(p) =
⋃
i∈{t[0],··· ,t[L−1]}
Ci(p),
migrated from p to all other processors during t as
M+
t
(p) =
⋃
i∈{t[0],··· ,t[L−1]}
M+i (p)
and migrated from all other processors to p during t as
M−
t
(p) =
⋃
i∈{t[0],··· ,t[L−1]}
M−i (p).
For a set of processors S ∈ P(Procs), define Rt(S) =
⋃
p∈S Rt(p),
Et(S) =
⋃
p∈S Et(p), Ct(S) =
⋃
p∈S Ct(p), M
+
t
(S) =
⋃
p∈SM
+
t
(p), and
M−
t
(S) =
⋃
p∈SM
−
t
(p). Additionally, define Et = Et(Procs), Ct = Ct(Procs),
M+
t
= M+
t
(Procs), and M−
t
= M−
t
(Procs).
The following result, Lemma 2.5, states that the set of nodes migrated to
a processor p during a round t is a subset of all the nodes that are migrated
from all processors but p during t. The proof of the following result can be
found in the Appendix (Section A.1).
Lemma 2.5. For any round t and p ∈ Procs, M−
t
(p) ⊆M+
t
(Procs− {p}).
As one may note, the definition of round (Definition 2.3) implies that for
any processor p and round t, |Ct(p)| ≤ 1 and M
+
t
(p)∩M+
t
(Procs−{p}) = ∅
(i.e. no two processors migrate the same node during the same round). By
Lemma 2.5, it then follows M+
t
(p) ∩M−
t
(p) = ∅.
The next lemma is essential for the rest of our analysis, as it shows the
connection between the set of nodes that are attached to each processor p
at some round t, and the set of nodes that are attached to p at round t+ 1.
The proof of this result can be found in the Appendix (Section A.2).
Lemma 2.6 (Round Progression Lemma). For any round t and processor
p ∈ Procs,
Rt+1 (p) =
(
Et (p) ∪Rt (p) ∪M
−
t
(p)
)
−
(
Ct (p) ∪M
+
t
(p)
)
.
5
2.1 Algorithm short-term stability
We now move to present algorithm short-term stability, the criterion that will
be used to measure the performance of structured computation schedulers.
We begin by stating the following requirement, which gives us the guarantee
that a processor p only executes a node µ during a round t if µ is attached
to p at the beginning of that round.
Requirement 2.7. For any round t and p ∈ Procs, we must have Rt (p) ⊇
Ct (p).
Definition 2.8 (Busy and Idle Processors). Say that a processor p ∈ Procs
is idle during a round t if Ct(p) = ∅, and, otherwise, say that p is busy.
Moreover, denote the number of idle processors during a round t by P idle
t
,
and define αt as the ratio of idle processors, αt = P
idle
t
/P .
Now, we introduce the notion of short-term stability. Intuitively, a set
of processors S is short-term stable for some round t if the number of nodes
attached to the processors in S that are not executed is expected to mono-
tonically decrease from round t to round t+ 1.
Definition 2.9 (Short-term stability). A set of processors S ∈ P(Procs)
is short-term stable for some round t during a computation’s execution if
E[|Rt+1(S)− Ct+1(S)|] ≤ |Rt(S)− Ct(S)|.
Ideally, we would want to ensure short-term stability for all rounds and
wrt all processors (i.e. S = Procs). However, since a processor can enable
two nodes during one round, a scheduler may only be able to guarantee
short-term stability wrt all processors if at least half of them are idle during
a round. For this reason, we will now move to introduce Algorithm short-term
stability, which is based on the same rationale as short-term stability.
For each round t, we classify processors according to whether they execute
all their attached nodes during t or not. If a processor p executes all its
attached nodes during round t (i.e. if Rt (p) = Ct (p)), then we say that p is
self-stable at round t. Otherwise, we say that p is non-self-stable at round t.
Definition 2.10. Define the set of self-stable and non-self-stable processors
at some round t as St = {p ∈ Procs |Rt (p) = Ct (p)} and Ut = Procs − St,
respectively.
Having this, we can finally define Algorithm short-term stability, our cri-
terion for measuring computation schedulers’ performance. Informally, the
main idea is that if the ratio of idle processors at some round t is sufficiently
high, then the amount of work attached to non-self-stable processors is ex-
pected to decrease, and, at the same time, the amount work attached to
self-stable processors does not grow unboundedly.
6
Definition 2.11 (Algorithm short-term stability). A scheduling algorithm
is algorithm short-term stable with respect to an interval I ⊆ ]0; 1[, iff (if
and only if) for any round t,
(αt ∈ I) ⇒
[(
E[|Rt+1(Ut)− Ct+1(Ut)|] < |Rt(Ut)− Ct(Ut)|
)
∧
(
∀p ∈ St,
∣∣Rt+1 (p)∣∣ ≤ L+ 1)
]
,
where L denotes the length of the rounds.
Note that, contrarily to short-term stability, algorithm short-term stabil-
ity requires that the expected number of nodes attached to processors of
Ut, that are not executed, strictly decreases from a round to the next. In
addition, by limiting the number of nodes that can become attached to a
self-stable processor during a round, we disallow scheduling algorithms to
keep ping-ponging work between non-self-stable and self-stable processors
throughout the execution. The insight for bounding the number of nodes to
the length of rounds is that we are enforcing processors to have to accept
each node they are given.
Intuitively, if an algorithm is algorithm short-term stable wrt some non-
empty interval I, then the algorithm’s load balancer is sufficiently effective to
guarantee that, regardless of the computation it is scheduling, for any round t
such that αt ∈ I, its performance is expected to be good (or, in other words,
work accumulation is not expected). On the other hand, if an algorithm
cannot guarantee algorithm short-term stability wrt any non-empty interval
I for any round during the execution of an arbitrary computation, then
the effectiveness of its load balancer is limited, and thus may lead to work
accumulation depending on the structure of the computation being executed.
As one may note, the definition of Algorithm short-term stability relies on
the overall behavior of the set of processors Ut, for each round t. However, it
is much simpler to reason about the behavior of each processor p ∈ Ut, than
it is to reason about the behavior of all processors of Ut. The next result is
then crucial for the rest of our analysis, as it shows the relation between the
behavior of all the processors of a set of processors S and the behavior of
each individual processor p of S.
Lemma 2.12. For any round t and S ∈ P (Procs), if
∀p ∈ S, E
[∣∣Rt+1 (p)− Ct+1 (p)∣∣] < |Rt (p)− Ct (p)|
then
E
[∣∣Rt+1 (S)− Ct+1 (S)∣∣] < |Rt (S)− Ct (S)| .
Proof. First, recall that Rt is partitioned through the processors. By Re-
quirement 2.7, Ct is also partitioned through the processors and ∀p ∈ Procs,
7
Rt (p) ⊇ Ct (p). Thus,∑
p∈S
|Rt (p)− Ct (p)| = |Rt (S)− Ct (S)|
To conclude the proof, note that by the linearity of expectation,∑
p∈S
E
[∣∣Rt+1 (p)− Ct+1 (p)∣∣] = E [∣∣Rt+1 (S)− Ct+1 (S)∣∣]

The following result is base for the analysis of schedulers, as it relates, for
each round t, the difference in the number of nodes that a processor p enables
during t, that are migrated from p during t, and that p executes during t+ 1,
with a result that is closely related with algorithm short-term stability (the
corresponding proof can be found in the Appendix, Section A.3).
Lemma 2.13. For any round t and processor p ∈ Procs,
|Et (p)| <
∣∣Ct+1 (p)∣∣+ ∣∣M+t (p)∣∣
iff ∣∣Rt+1 (p)− Ct+1 (p)∣∣ < |Rt (p)− Ct (p)|+ ∣∣M−t (p)∣∣.
The following Corollary then follows from Lemma 2.13.
Corollary 2.14. For any round t, if (p ∈ Ut) ⇒
(
M−
t
(p) = ∅
)
, then
|Et (p)| <
∣∣Ct+1 (p)∣∣+ ∣∣M+t (p)∣∣
iff ∣∣Rt+1 (p)− Ct+1 (p)∣∣ < |Rt (p)− Ct (p)| .
3 A method to analyze randomized schedulers
In order to analyze the performance of randomized scheduling algorithms,
we introduce a few additional definitions and make some assumptions that
are necessary to permit ordering the actions that processors take during
the execution of computations, and, in particular, during each round. The
reason for the need to order the actions of processors will become apparent
as we use it to analyze the WS algorithm. To aid the reader, as we present
the extra definitions and assumptions that our methodology requires, we use
a WS algorithm (depicted in Algorithm 1) to instantiate them and explain
their meaning.
The WS algorithm we analyze is a synchronous but behaviorally equiva-
lent variant of the original non-blocking algorithm given in [ABP01]. Thus,
8
each processor owns a lock-free deque object that supports three methods:
pushBottom, popBottom and popTop. Only the owner of a deque may invoke
the pushBottom and popBottom methods, which, respectively, add a node
to the bottom of the deque, and remove and return the bottommost node of
the deque, if any. The popTop method is invoked by processors searching for
work, and for each invocation to this method, the deque’s current topmost
node is guaranteed to be removed and returned, either by such invocation
or by some concurrent one1. In addition to the deque, each processor has a
variable assigned that stores the node that it will execute next, if any.
3.1 The methodology
First of all, we require that the scheduling algorithm to be analyzed must be
defined by a cycle such that:
1. at most one of the instructions composing any particular iteration of
this cycle may correspond to a node’s execution;
2. no node that is migrated to a processor p, who is executing an iteration
of this cycle, can be migrated again (to another processor), before p
finishes the current iteration;
3. the length of any sequence of instructions that corresponds to some
execution of this cycle is at most constant; and
4. the full sequence of instructions executed by any processor can be par-
titioned into smaller sub-sequences, each corresponding to a particular
execution of this cycle.
Refer to this cycle as the scheduling loop, and to any sequence of instructions
that correspond to some iteration of a scheduling loop as scheduling iteration.
As it can be observed in Algorithm 1, the definition of the WS algorithm
naturally fits into scheduling loops (corresponding to lines 2 to 19): 1. at
most one of the instructions within the sequence of a scheduling iteration
corresponds to the execution of a node (line 4); 2. no node that is migrated
to a processor is migrated ever again, as it becomes the processor’s new
assigned node (line 23); 3. the length of any iteration of the scheduling loop
is bounded by a constant; and 4. the full sequence of instructions executed
by any processor can be partitioned into scheduling iterations.
To order the actions that processors take during scheduling iterations,
each iteration can be partitioned into a sequence of phases. In particular, for
WS, each iteration is partitioned into two phases:
1For a more careful description of the lock-free deque semantics, originally defined
in [ABP01], please refer to Section B.
9
Algorithm 1 The synchronous WS algorithm
1: procedure Scheduler
2: while not finished (computation) do
3: if ValidNode(assigned) then
4: enabled←execute(assigned)
5: assigned← none
6: synch(max_phaseI_length, ι())
7: if length(enabled) > 0 then
8: assigned← enabled [0]
9: if length(enabled) = 2 then
10: self.deque.pushBottom(enabled [1])
11: end if
12: else
13: assigned← self.deque.popBottom()
14: end if
15: else
16: self .WorkMigration()
17: end if
18: synch(max_phaseII_length, ι())
19: end while
20: end procedure
21: procedure WorkMigration
22: victim← UniformlyRandomProcessor()
23: assigned← victim.deque.popTop()
24: synch(max_phaseI_length, ι())
25: end procedure
26: function ValidNode(node)
27: return node 6= empty
28: and node 6= abort
29: and node 6= none
30: end function
Phase I If a processor has a valid assigned node, it executes the node. Oth-
erwise, it makes a steal attempt, and if the attempt succeeds the stolen node
becomes the processor’s new assigned node.
Phase II If a processor made a steal attempt in the previous phase, it takes
no action during this phase. Otherwise, the processor executed a node in
the previous stage, which enabled either 0, 1 or 2 nodes. If no node was
enabled, the processor invokes popBottom to fetch the bottommost node
from its deque, if such node exists. If at least one node was enabled, one
of the enabled nodes becomes the processor’s new assigned node, whist the
other node, if any, is pushed by the processor into the bottom of its own
deque, via the pushBottom method.
At this point, we have already ordered the actions that each processor
10
takes during the execution of every iteration. However, this ordering by itself
does not meet our needs, as we have to guarantee that all processors start
the execution of each phase of every scheduling iteration at the same time.
Our first step towards that goal is to require all processors to begin working
at the exact same time. Refer to the step at which a processor p executes
its i-th instruction as χ (p, i).
Requirement 3.1. ∀p ∈ Procs, χ (p, 1) = 0.
Now, we present the synch procedure, which allows to synchronize pro-
cessors at the end of each phase. The synch procedure takes two input
parameters: 1. maxPhaseLength — the length of a longest sequence of in-
structions that may compose a given phase, and; 2. currentPhaseLength
— the number of time steps during which the processor has been executing
the current phase, until the procedure’s invocation. Given these parameters,
synch adds a sequence of maxPhaseLength − currentPhaseLength no-op
instructions, guaranteeing that the number of steps taken from the begin-
ning of each phase’s execution until the end of its call is the same for all
processors. To use synch, we rely on the purely theoretical procedure ι to
obtain the value of currentPhaseLength.
For last, we partition a computation’s execution even further, by parti-
tioning all rounds equally into sequences of stages. To formalize this idea,
define a stage partition s¨ ∈ N × N, as s¨ = (base, offset), with offset > 0,
where base and offset are, respectively, the starting step and length of the
stage defined by s¨ within each round. Refer to the i-th stage of a round t as
t 〈i〉.
Definition 3.2. Let L be the length of the rounds. Say that a set S¨ is a set
of stage partitions if L =
∑
s¨∈S¨ pi2(s¨) and ∀s¨ ∈ S¨
(
[∃r¨ ∈ S¨ : pi1(s¨)+pi2(s¨) =
pi1(r¨)] ∨ [pi1 (s¨) + pi2 (s¨) = L]
)
, where pii(t) denotes the projection of the i-th
element of tuple t.
Remark 3.3. To analyze a scheduler’s performance using our methodology
it suffices to: 1. define the scheduler by a scheduling loop; 2. divide the
actions that processors take during each iteration of the loop (by partition-
ing each scheduling iteration into phases); and 3. insert a call to the synch
procedure at the end of each phase.
Justification. By using the synch and ι procedures, one can guarantee that
any scheduling algorithm, that may be defined by a scheduling loop, can be
modified so that processors are kept synchronized throughout any computa-
tion’s execution, having that all processors begin the execution of the i-th
phase of the n-th scheduling iteration at the exact same step. With this, the
length of each round can be set to
∑
i∈Phases lengthi, where Phases denotes
the set of phases that compose a scheduling iteration and lengthi denotes
11
the length of the i-th phase2. Note that, since all processors execute each
scheduling iteration synchronized, the definition of scheduling loop ensures
us that the requirements of the definition of round are satisfied: 1. each
round has constant length; 2. a computation’s execution can be partitioned
into a sequence of equal-length rounds; 3. during each round no processor
executes more than a single node; and 4. no node is migrated more than
once during a round. Then, it only remains to partition each round into a
sequence of stages, having one stage per phase, and ensuring that the execu-
tion of the i-th phase of a scheduling iteration coincides with the i-th stage
of the corresponding round. 
In the synchronous WS scheduler, depicted in Algorithm 1,
max_phaseI_length and max_phaseII_length are two constants that cor-
respond to the lengths of the longest sequences of instructions composing the
first and second phases of WS, respectively. Thus, by Remark 3.3, we can set
the length of WS’s rounds to max_phaseI_length+max_phaseII_length,
and partition each such round into two stages whose length matches the
maximum length of the corresponding phase. To proceed to analysis of WS’s
performance, it only remains to show that WS satisfies Requirement 2.7. For
WS, say that a node µ is attached to a processor p if one of the following con-
ditions holds: 1. µ is p’s currently assigned node; 2. µ is stored in p’s deque;
or 3. µ is stored in enabled [0] or enabled [1] (see line 4 of Algorithm 1). At
the beginning of any round, each node that is attached to a processor is
either in its deque or is the processor’s currently assigned node. As it can be
observed in Algorithm 1, each processor only executes the node that is stored
in its assigned variable. Since the value of this variable is not changed at
least until the processor executes the node, then the node was already stored
in the assigned variable when the round began, and so the requirement is
satisfied.
3.2 Work Stealing’s performance
To show that the synchronous WS algorithm (as defined in Algorithm 1)
is not Algorithm short-term stable, we will create a computation for which
work tends to accumulate unboundedly in some busy processors’ deques.
Before moving to the actual proof, however, we have to make an additional
definition.
Definition 3.4. Refer to the set of nodes stolen at step i from a processor p
as Stolen+i (p), and to the set of nodes stolen by p as Stolen
−
i (p). Moreover,
2Note that, by including the call to the synch procedure at the end of each phase, we
ensure that the i-th phase of every scheduling iteration has the same length.
12
for some round t, define the set of nodes stolen during t from p as
Stolen+
t
(p) =
⋃
i∈{t[0],...,t[L−1]}
Stolen+i (p)
and the set of nodes stolen by p as
Stolen−
t
(p) =
⋃
i∈{t[0],...,t[L−1]}
Stolen−i (p)
Lemma 3.5. For WS, at any round t, M+
t
(p) = Stolen+
t
(p) and M−
t
(p) =
Stolen−
t
(p).
Proof. Both results follow from Definition 3.4 and the specification of Algo-
rithm 1. 
We now move to obtain both lower and upper bounds on the expected
number of nodes that are stolen from a non-self-stable processor during a
round.
Lemma 3.6. Suppose there are B bins and B.α balls, and that each ball is
tossed independently and uniformly at random into the bins. For a bin bi, let
Yi be an indicator variable, defined as
Yi =
{
1 if at least one ball lands in bi;
0 otherwise.
Then, E [Yi] = P {Yi = 1} ≥ 1− e
−α.
Proof. The probability that no ball lands in bi is P {Yi = 0} =
(
1− 1
B
)Bα
≤
e−α. To conclude, E [Yi] = P {Yi = 1} ≥ 1− e
−α. 
Lemma 3.7. For any round t and p ∈ Ut during a computation’s execution
using WS, we have 1− e−αt ≤ E[|Stolen+
t
(p)|] ≤ αt.
Proof. By observing Algorithm 1, it follows that a processor makes a steal
attempt iff it is idle, implying that exactly Pαt steal attempts are made dur-
ing round t. Note that: 1. steal attempts are independent from one another;
and 2. a steal attempt corresponds to targeting a processor uniformly at ran-
dom and then invoking the popTop method to its deque. If we imagine that
each steal attempt is a ball toss and that each processor’s deque is a bin, it
follows by Lemma 3.6 that the probability of p’s deque being targeted is at
least 1−e−αt . On the other hand, the expected number of invocations to the
popTop method of any processor p’s deque is (Pαt) /P = αt. Since p may
only invoke the popBottom method of its deque during the second phase and
the all the steal attempts take place during the first phase, then, taking into
account the deque semantics (see Section B): 1. if p’s deque is targeted by at
13
least one steal attempt, then at least one node is stolen; and 2. at most one
node might be returned for each invocation to the popTop method. Thus,
E[|Stolen+
t
(p)|] ≥ 1− e−αt and E[|Stolen+
t
(p)|] ≤ αt. 
Lemma 3.8. For any round t, and p ∈ Ut we have 1−e
−α
t ≤ E[|M+
t
(p)|] ≤
αt, where αt is the ratio of idle processors.
Proof. Lemmas 3.5 and 3.7 imply this result. 
The next lemma follows from the behavior of the WS algorithm.
Lemma 3.9. Consider some processor p ∈ Procs and some round t during
the execution of a computation by WS. If p ∈ Ut then p’s deque is non-empty
and M−
t
(p) = ∅. If p ∈ St then
∣∣Rt+1 (p)∣∣ ≤ 2.
Proof. By the definition of Algorithm 1 it can be proved by induction on
the progression of a computation’s execution that if a processor has at least
one attached node at the beginning of round t, then the processor executes
a node during t. From that, and by observing the algorithm, it follows that
if p has at least one node attached, then it does not make any steal attempt
during t, implying Stolen−
t
(p) = ∅. Lemma 3.5 then implies M−
t
(p) = ∅.
On the other hand, since p always executes one of its attached nodes if there
is any, it follows that if p ∈ Ut then p’s deque is not empty.
If p only has a single attached node, then p ∈ St. Because it has one
attached node, it follows Stolen−
t
(p) = ∅. Again, Lemma 3.5 then implies
M−
t
(p) = ∅. In addition, since the out-degree of any node is at most two
(by our conventions regarding computations’ structure), then at the end of
the round p has at most two attached nodes.
Finally we show that if Rt (p) = ∅, then at the end of the round p has
at most one attached node. If p has no attached node, then its assigned
variable does not contain a valid node, implying that p executes a call to
the WorkMigration procedure. Since each call only entails one invocation
to the popTop method, then, taking into account the method’s semantics3
it follows that p may only get at most one node from its steal attempt.
Since after performing such attempt, p takes no further action during the
scheduling iteration other than simply waiting for it to end, we conclude the
lemma holds. 
From Lemma 3.9 and the definition of round (Definition 2.3), it follows
∀p ∈ St,
∣∣Rt+1 (p)∣∣ ≤ 2 ≤ L+ 1, where L is the length of the rounds, which
is at least 1. Thus, if we were to show that WS is algorithm short-term
stable wrt some interval I ⊆ ]0; 1[, then, by Corollary 2.14, we would only
have to prove that for any round t such that αt ∈ I, we had |Et (p)| <
E
[∣∣Ct+1 (p)∣∣ + ∣∣M+t (p)∣∣]. Unfortunately, as we now prove, there is no non-
empty interval I wrt which WS is algorithm short-term stable.
3More details on the appendix, Section B.
Theorem 3.10. There is no non-empty interval I ⊆ ]0; 1[ such that WS (as
defined in Algorithm 1) is algorithm short-term stable wrt I.
Proof. Due to our conventions related with the computations’ structure, it
follows that during a round a processor can enable two nodes. For some
round t, let p be a non-self-stable processor (i.e. p ∈ Ut) such that |Et (p)| = 2.
Lemmas 2.6 and 3.9 imply Rt+1 (p) = (Et (p) ∪Rt (p))−(Ct (p)∪M
+
t
(p)). As
already noted in the proof of Lemma 3.9, for the WS algorithm, since p ∈ Ut,
it follows that |Ct (p)| = 1. Since we have 1. Et (p)∪Rt (p) ⊇ Ct (p)∪M
+
t
(p);
2. Et (p) ∩ Rt (p) = ∅; and 3. Ct (p) ∩M
+
t
(p) = ∅ 4, then
∣∣Rt+1 (p)∣∣ =
|Et (p)|+|Rt (p)|−|Ct (p)|−|M
+
t
(p) |. By Lemma 3.8 and since p enabled two
nodes, it follows E[
∣∣Rt+1 (p)∣∣] ≥ 2+ |Rt (p)|−1−αt = |Rt (p)|+1−αt. Since
p enabled two nodes, it executes a node during the next round, implying∣∣Ct+1 (p)∣∣ = 1. It then follows, E[∣∣Rt+1 (p)−Ct+1 (p)∣∣] ≥ |Rt (p)| + 1 −
αt − 1 = |Rt (p)| − αt. Even though the definition of algorithm short-term
stability only considers ratios of idle processors in ]0; 1[, note that for WS, if
all processors are idle, then the computation’s execution must have already
finished, and so it only makes sense to analyze rounds during which the
execution is still ongoing. It then follows that αt < 1, implying E[|Rt+1(p)−
Ct+1(p)|] ≥ |Rt (p)| − αt > |Rt(p) − Ct(p)|. Thus, if during round t, p
were the only non-self-stable processor (i.e. Ut = {p}), then E[|Rt+1(Ut) −
Ct+1(Ut)|] > |Rt(Ut)− Ct(Ut)|. 
As one might note, this implies that WS cannot even guarantee short-
term stability for the set {p} (recall Definition 2.9), regardless of the ratio of
idle processors.
4 A greedy Work Stealing and Spreading algorithm
In this section we present and analyze the performance of a (purely theoret-
ical) greedy Work Stealing and Spreading scheduler — or simply WSS. This
algorithm (depicted in Algorithm 2) is a variant of WS where processors load
balance not only by stealing work, but also by spreading it. As in WS, each
processor owns a lock-free deque (obeying the semantics defined in [ABP01])
and a variable assigned that stores the node that it will execute next, if any.
To implement the spreading mechanism each processor additionally owns a
state flag and a donation cell. Processors use the state flag to inform other
processors on their current state — working, idle or marked as target of a
donation (more on this ahead) — and use the donation cell to store nodes
that they want to spread. In WSS processors are uniquely identified by an id,
with which they can be accessed in constant time. The scheduler also makes
4For a formal proof of these claims see Claims A.13, A.10 and A.12 for parts 1, 2 and
3, respectively, with M−
t
(p) = ∅.
15
use of the CAS instruction (Compare-And-Swap), with its usual semantics.
Thus, at most one CAS instruction targeting the same memory location can
successfully execute at each step. We assume that the processor that suc-
ceeds executing the CAS instruction over a memory address m at some step
i is chosen uniformly at random from the set of processors that are eligible
to successfully execute the instruction at step i over memory address m.
Contrarily to WS, we partition each scheduling iteration of WSS into
three phases. Phase I of WSS is very similar to the WS’s counterpart, only
differing because in WSS processors keep updating their state flags to reflect
their current state. Phases II and III of WSS are as follows:
Phase II If, in phase I, a processor p made a steal attempt or executed a
node that did not enable any node, then p does not take any action during
this phase. Otherwise, if at least one node was enabled, one of the enabled
nodes becomes p’s new assigned node. If two nodes were enabled, then, after
having a new node assigned, p attempts to spread the node it did not assign.
Phase III If a processor p executed a node in phase I but no node was
enabled, p invokes popBottom to fetch the bottommost node from its deque,
if there is any. On the other hand, if a single node was enabled, p does
not take any action during this phase. If two nodes were enabled, p only
takes action if the donation attempt it made during phase II failed. In
such scenario, p pushes the node it failed to donate into the bottom of its
own deque, via the pushBottom method. Finally, if the processor made an
unsuccessful steal attempt during the first phase, it polls its state flag to
check for incoming donations. If there is a donation, p transfers the node
from the donor’s donation cell and updates its state flag accordingly.
Definition 4.1. Refer to the set of nodes spread at step i by a processor p as
Spread+i (p), and to the set of nodes spread to p as Spread
−
i (p). Moreover,
for some round t, define the set of nodes spread during t by p as
Spread+
t
(p) =
⋃
i∈{t[0],...,t[L−1]}
Spread+i (p) ,
and the set of nodes spread to p as
Spread−
t
(p) =
⋃
i∈{t[0],...,t[L−1]}
Spread−i (p) .
The next claim implies that we can use the methodology described in
Section 3 to analyze the performance of WSS.
Claim 4.2. The WSS algorithm can be defined using a scheduling loop and
meets Requirement 2.7.
16
Proof Sketch. As one can observe from Algorithm 2, like WS, WSS can also
be naturally defined using scheduling loops (lines 2 to 24): 1. at most one
node is executed per scheduling iteration; 2. if a node is migrated to a proces-
sor during an iteration, then it is not migrated again; 3. the length of every
iteration is bounded by a constant; and 4. the full sequence of instructions
executed by any processor can be partitioned into a sequence of scheduling
iterations. As for WS, and for the same reasons, we do not formally show
that the loop starting at line 2 and ending at line 25 of Algorithm 2 satisfies
the requirements of a scheduling loop. Nevertheless, it is easy to deduce, by
observing the definition of the scheduler (given in Algorithm 2), that this
claim holds. 
Taking into account Claim 4.2 and Remark 3.3, we can begin WSS’s
analysis. First, we show that for WSS, a node is migrated iff it is stolen or
spread.
Lemma 4.3. For WSS, at any round t and for any processor p, M+
t
(p) =
Stolen+
t
(p) ∪ Spread+
t
(p) and M−
t
(p) = Stolen−
t
(p) ∪ Spread−
t
(p).
Proof. Both results follow from Definition 3.4 and Algorithm 2. 
The next lemma is analogous to Lemma 3.9, but concerning the WSS
algorithm.
Lemma 4.4. Consider some p ∈ Procs and some round t during the execu-
tion of a computation by WSS. Then: 1. if p ∈ Ut then p’s deque is non-empty
and M−
t
(p) = ∅; and 2. if p ∈ St then
∣∣Rt+1 (p)∣∣ ≤ 2.
Proof. This proof follows the same general arguments as the proof of Lemma 3.9.
As for WS, taking into account the definition of the WSS scheduler (de-
picted in Algorithm 2) it can be proved by induction on the progression of
the computation’s execution that if a processor has at least one attached
node at the beginning of round t, then the processor executes a node during
t. From that, and by observing the algorithm, it follows that if p has at
least one node attached, then: 1. p does not make any steal attempt during
t, implying Stolen−
t
(p) = ∅; and 2. p’s state flag is set to working at least
until the beginning of the third stage of round t, implying that no proces-
sor donates work to p, and so Spread−
t
(p) = ∅. Thus, taking into account
Lemma 4.3, if p has at least one attached node then M−
t
(p) = ∅.
To conclude the proof of the first statement of this lemma, note that
because p always executes one of its attached nodes as long as there is any,
it follows that if p ∈ Ut then p has at least two nodes attached and so p’s
deque can not be empty.
Again, since p executes one of its attached nodes as long as there is any,
if p only has a single attached node then p ∈ St and M
−
t
(p) = ∅. By the
17
nodes’ out-degree assumption, it then follows that at the end of round t, p
can have at most two attached nodes.
Finally we show that if Rt (p) = ∅, then at the end of the round p has
at most two attached nodes. If p has no attached node, then its assigned
variable does not contain a valid node, implying that p executes a call to the
LoadBalance procedure (line 22). Since each call only entails one invocation
to the popTopmethod, then, taking into account the method’s semantics (see
Section B) it follows that p may only get at most one node from its steal
attempt. On the other hand, as one can deduce from the definition of the
LoadBalance procedure, p can only accept at most one node donation during
the procedure’s invocation5. Thus, at the end of the call p can only have at
most two attached nodes. To conclude the proof of the second part of the
lemma, note that, after the call to the LoadBalance procedure returns, p
takes no further action during the iteration. 
With this, we start obtaining bounds on the expected number of nodes
that are migrated during a round for WSS. To begin, we obtain both lower
and upper bounds on the expected number of nodes stolen for WSS.
Lemma 4.5. For any round t and p ∈ Ut during a computation’s execution
using WSS, 1− e−αt ≤ E[|Stolen+
t
(p)|] ≤ αt.
Proof. The proof of this result is identical to the proof of Lemma 3.7, follow-
ing from Lemma 4.4 and the definition of WSS (see Algorithm 2). 
Next, we obtain lower bounds on the expected number of nodes spread
by any processor that enables two nodes. A full proof of this result can be
found in the Appendix (Section C.1).
Lemma 4.6. ∀p ∈ Ut, if |Et (p)| = 2 then E[|Spread
+
t
(p) |] ≥
α2
t
1−α
t
(
1− e−(1−αt)
)
.
The following Lemma, together with Lemmas 4.5 and 4.6, allow us to
obtain bounds on the expected number of nodes that are migrated from a
processor. A full proof can be found in the Appendix (Section C.2).
Lemma 4.7. For WSS, at any round t, Stolen+
t
(p) ∩ Spread+
t
(p) = ∅.
The next result states an inequality that will be used to prove that WSS
is algorithm short-term stable wrt interval [0, 7375; 1[, and its full proof can
be found in the Appendix (Section C.3).
Lemma 4.8. ∀α ∈ [0, 7375; 1[ , 1 < 1− e−α + α
2
1−α
(
1− e−(1−α)
)
.
Finally, we can prove that WSS overcomes the limitations of WS.
5In fact, p only accepts a donation if its steal attempt failed, and so p can only have
at most one attached node.
18
Theorem 4.9. WSS (as defined in Algorithm 2) is algorithm short-term
stable wrt [0, 7375; 1[.
Proof. Recall that L denotes the length of each round, and that by definition
L ≥ 1. Then, from Lemma 4.4, it follows that ∀p ∈ St,
∣∣Rt+1 (p)∣∣ ≤ 2 ≤ L+1.
Furthermore, taking into account Lemma 2.12 and Corollary 2.14, it follows
that to prove this theorem it suffices to show that for any round t such that
αt ∈ [0, 7375; 1[, we have ∀p ∈ Ut, |Et (p)| < E[
∣∣Ct+1 (p)∣∣+ ∣∣M+t (p)∣∣].
For an arbitrary round t, consider any processor p ∈ Ut (i.e. p is a
processor that is non-self-stable at round t). Due to our conventions related
with computations’ structure, |Et (p)| is either 0, 1, or 2:
• If |Et (p)| = 0 then, by Lemma 4.5 it follows |Et (p)| < E[
∣∣Ct+1 (p)∣∣ +∣∣M+
t
(p)
∣∣].
• If |Et (p)| = 1 then, by the specification of the WSS algorithm, it
follows
∣∣Ct+1 (p)∣∣ = 1. Taking into account Lemma 4.5, we deduce
|Et (p)| < E[
∣∣Ct+1 (p)∣∣+ ∣∣M+t (p)∣∣].
• By the specification of the WSS algorithm, it follows that if |Et (p)| =
2 then
∣∣Ct+1 (p)∣∣ = 1. Thus, to prove this case it suffices to show
1 < E[
∣∣M+
t
(p)
∣∣]. We now prove just that. By Lemmas 4.3 and 4.7, it
follows
∣∣M+
t
(p)
∣∣ = ∣∣Stolen+
t
(p)
∣∣+ ∣∣Spread+
t
(p)
∣∣, and by Lemmas 4.5
and 4.6, it follows E[
∣∣M+
t
(p)
∣∣] ≥ 1− e−αt + α2t1−α
t
(
1− e−(1−αt)
)
. Thus,
taking into account Lemma 4.8, having α = αt, we conclude the proof
of Theorem 4.9.

5 Related work
To the best of our knowledge, there is no work that analyzes the performance
of online structured computation schedulers, on a round basis, depending
solely on the ratio of idle processors.
Most theoretical work dealing with the study of online structured com-
putation schedulers, has focused on proving properties related with the
(complete) execution of computations by WS and variants. Blumofe et al.
proved that WS is optimal up to a constant factor in terms of space require-
ments, expected execution time, and expected communication costs [BL99].
Arora et al. showed that WS is optimal even for multiprogrammed envi-
ronments [ABP98, ABP01]. Agrawal et al. introduced a variant of WS
that avoids unnecessary load balancing cycles in order to achieve higher ef-
ficiency [AHHL07, ALHH08]. The authors proved that WS is capable of
maintaining nearly optimal bounds, while reducing the number of cycles
during which processors are not making progress on a computation’s exe-
cution (corresponding to load balancing cycles), down to a constant factor
19
away from the computation’s total amount of work. Regarding data lo-
cality, Acar et al. obtained both lower and upper bounds on the number
of cache misses using WS [ABB02]. More recent research has been focus-
ing on reducing the synchronization overheads of WS [ACR13], mainly by
eliminating synchronization for local deque operations (i.e. eliminating the
need for synchronization when processors work locally on their own deque).
Even more recently, Muller et al. studied the performance of WS for com-
putations that include latency operations (such as receiving input from a
user), obtaining promising results [MA16]. On the other hand, most prac-
tical work that deals with the scheduling of structured computations has
focused either on the improvement of current WS implementations — in-
creasing data locality [ABB02, GZCS10, QW10, SLS16], reducing synchro-
nization overheads [ACR13, HYUY09, MVS09, MA14, vDvdP14], etc —, or
on the development of libraries and languages implementing WS on both
shared memory environments [ACR16, BJK+96, Fax08, Lei09, MA16] and
distributed settings [CKK+08, DLS+09, LKK12, QW10].
While, for the execution of structured computations, work generation de-
pends on what has already been executed, for independent task scheduling,
work generation (or, more correctly, task arrival) is assumed to be inde-
pendent from what tasks processors already executed [ACMR98, ABKU99,
BFG03, LM93, Mit98, MPS02]. In fact, much of the work in this area con-
sists on studying the effectiveness of different strategies (that rely on ran-
domness) for placing n balls (each representing a task) into n bins (each
representing a processor) [ACMR98, ABKU99, LM93, MPS02], being that
a strategy’s effectiveness is measured according to the number of balls that
the fullest bin is expected to have: the lower this number is, the more effec-
tive the strategy is. Of course, this type of models, despite being suitable
for modelling independent task schedulers, are far from being apt to model
the performance of structured computation schedulers (for example, note
that in the execution of a structured computation, work is generated per
processor). Within the area of independent task scheduling, perhaps the
work most closely related to ours is on the performance analysis of online
independent task schedulers [BFG03, Mit98]. Yet, to the best of our knowl-
edge, all the analyzes made to these schedulers rely upon the assumption
that tasks arrive to the system according to some random distribution (typi-
cally Poisson’s distribution). For instance, Mitzenmacher proposed a simple
but powerful scheme to analyze independent task work stealing sched-
ulers, that uses differential equations [Mit98]. This scheme allows to study
not only the most basic work stealing schedulers (of independent tasks), but
also more complex variants (e.g. allowing processors to repeat a steal when-
ever its steal attempt aborted). Nevertheless, the proposed scheme relies
on the assumption that work is generated according to some random dis-
tribution, and so it is not suitable for modelling the behavior of structured
computation schedulers. Berenbrink et al. study the performance of inde-
20
pendent task work stealing schedulers, modelling the system as a Markov
chain, whose states denote the number of tasks attached to each processor
of the system [BFG03]. The authors proved that the work stealing scheduler
for independent tasks, where each steal is allowed to take up to half of a
processor’s work, is stable for a long term execution. Unfortunately, their
analysis also relies on the assumption that tasks arrive at the system accord-
ing to a random distribution, and so it is not apt to model the performance
of structured computation schedulers. In addition, the authors assume that
the number of tasks generated at each round is at most the number of pro-
cessors, which, taking into account the standard conventions regarding the
structure of computations [ABP98, ABP01, BL99, ACR13], is not realistic
for modelling schedulers of structured computations.
Although it may not be entirely straightforward, it is possible to use
our methodology to model the steal-half work stealing algorithm. To do so,
each steal would have to be divided into a sequence of scheduling iterations,
such that during each iteration the thief transferred a node from its victim.
However, transferring half of a processor’s work may take some time, which
not only implies that the thief will have to wait until it can begin executing
what it stole, but it also means that either concurrent steal attempts to
the same deque are delayed (to avoid duplicate steals), or thieves have to
first transfer the work they intend to steal from their victims and only then
attempt to commit the steal. Regarding the later option, note that if a thief
is transferring work from one of the only processors that is generating work,
then the steal attempt is likely to fail. Moreover, since during each round
a processor can enable two nodes, then, it would still be possible that the
processor whose deque was being stolen generated a large amount of work.
6 Conclusion
We introduced a formal framework for the performance analysis of structured
computation schedulers, and defined an appropriate criterion for measuring
the performance of online scheduling algorithms: algorithm short-term sta-
bility. Moreover, we introduced a simple and powerful method that allows to
analyze the performance of these schedulers, and have demonstrated its con-
venience by using it with two different ends: 1. proving that the performance
of WS is indeed limited; and 2. analyzing the performance of WSS. Although
WSS is a purely theoretical algorithm, its analysis gave us insight on how to
possibly overcome the limitation of WS. Nevertheless, the greedy spreading
strategy of the algorithm has a severe limitation that makes us question its
practical value: even if every processor is busy, whenever a processor gener-
ates work it makes a spread attempt. This not only makes processors incur in
unnecessary overheads (that, for modern computer architectures, are unduly
large) but even more importantly, it entails a serious drawback concerning
21
the communication costs of the algorithm. Consequently, it is still an open
problem to come up with a practical algorithm that overcomes WS’s limita-
tion while maintaining its asymptotically optimal expected execution time
and communication costs, and its low space requirements.
References
[ABB02] Umut A. Acar, Guy E. Blelloch, and Robert D. Blumofe. The
data locality of work stealing. Theory Comput. Syst., 35(3):321–
347, 2002.
[ABKU99] Yossi Azar, Andrei Z. Broder, Anna R. Karlin, and Eli Upfal.
Balanced allocations. SIAM J. Comput., 29(1):180–200, 1999.
[ABP98] Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton.
Thread scheduling for multiprogrammed multiprocessors. In
SPAA, pages 119–129, 1998.
[ABP01] Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton.
Thread scheduling for multiprogrammed multiprocessors. The-
ory Comput. Syst., 34(2):115–144, 2001.
[ACMR98] Micah Adler, Soumen Chakrabarti, Michael Mitzenmacher, and
Lars Eilstrup Rasmussen. Parallel randomized load balancing.
Random Struct. Algorithms, 13(2):159–188, 1998.
[ACR13] Umut A. Acar, Arthur Charguéraud, and Mike Rainey. Schedul-
ing parallel programs by work stealing with private deques. In
ACM SIGPLAN Symposium on Principles and Practice of Paral-
lel Programming, PPoPP ’13, Shenzhen, China, February 23-27,
2013, pages 219–228, 2013.
[ACR16] Umut A. Acar, Arthur Charguéraud, and Mike Rainey. Pasl:
Parallel algorithm scheduling library, 2016. [Online; accessed
21-January-2016].
[AHHL07] Kunal Agrawal, Yuxiong He, Wen-Jing Hsu, and Charles E. Leis-
erson. Adaptive scheduling with parallelism feedback. In 21th
International Parallel and Distributed Processing Symposium
(IPDPS 2007), Proceedings, 26-30 March 2007, Long Beach, Cal-
ifornia, USA, pages 1–7, 2007.
[ALHH08] Kunal Agrawal, Charles E. Leiserson, Yuxiong He, and Wen-Jing
Hsu. Adaptive work-stealing with parallelism feedback. ACM
Trans. Comput. Syst., 26(3), 2008.
22
[BFG03] Petra Berenbrink, Tom Friedetzky, and Leslie Ann Goldberg.
The natural work-stealing algorithm is stable. SIAM J. Com-
put., 32(5):1260–1279, 2003.
[BJK+96] Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul,
Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. Cilk:
An efficient multithreaded runtime system. J. Parallel Distrib.
Comput., 37(1):55–69, 1996.
[BL99] Robert D. Blumofe and Charles E. Leiserson. Scheduling multi-
threaded computations by work stealing. J. ACM, 46(5):720–748,
1999.
[CKK+08] Guojing Cong, Sreedhar B. Kodali, Sriram Krishnamoorthy,
Doug Lea, Vijay A. Saraswat, and Tong Wen. Solving large, irreg-
ular graph problems using adaptive work-stealing. In 2008 Inter-
national Conference on Parallel Processing, ICPP 2008, Septem-
ber 8-12, 2008, Portland, Oregon, USA, pages 536–545, 2008.
[DLS+09] James Dinan, D. Brian Larkins, P. Sadayappan, Sriram Krish-
namoorthy, and Jarek Nieplocha. Scalable work stealing. In
Proceedings of the ACM/IEEE Conference on High Performance
Computing, SC 2009, November 14-20, 2009, Portland, Oregon,
USA, 2009.
[Fax08] Karl-Filip Faxén. Wool-a work stealing library. SIGARCH Com-
puter Architecture News, 36(5):93–100, 2008.
[GZCS10] Yi Guo, Jisheng Zhao, Vincent Cavé, and Vivek Sarkar. SLAW:
A scalable locality-aware adaptive work-stealing scheduler. In
24th IEEE International Symposium on Parallel and Distributed
Processing, IPDPS 2010, Atlanta, Georgia, USA, 19-23 April
2010 - Conference Proceedings, pages 1–12, 2010.
[HS02] Danny Hendler and Nir Shavit. Non-blocking steal-half work
queues. In Proceedings of the Twenty-First Annual ACM Sympo-
sium on Principles of Distributed Computing, PODC 2002, Mon-
terey, California, USA, July 21-24, 2002, pages 280–289, 2002.
[HYUY09] Tasuku Hiraishi, Masahiro Yasugi, Seiji Umatani, and Taiichi
Yuasa. Backtracking-based load balancing. In Proceedings of the
14th ACM SIGPLAN Symposium on Principles and Practice of
Parallel Programming, PPOPP 2009, Raleigh, NC, USA, Febru-
ary 14-18, 2009, pages 55–64, 2009.
[Lei09] Charles E. Leiserson. The cilk++ concurrency platform. In Pro-
ceedings of the 46th Design Automation Conference, DAC 2009,
San Francisco, CA, USA, July 26-31, 2009, pages 522–527, 2009.
23
[LKK12] Jonathan Lifflander, Sriram Krishnamoorthy, and Laxmikant V.
Kalé. Work stealing and persistence-based load balancers for it-
erative overdecomposed applications. In The 21st International
Symposium on High-Performance Parallel and Distributed Com-
puting, HPDC’12, Delft, Netherlands - June 18 - 22, 2012, pages
137–148, 2012.
[LM93] Reinhard Lüling and Burkhard Monien. A dynamic distributed
load balancing algorithm with provable good performance. In
SPAA, pages 164–172, 1993.
[MA14] Adam Morrison and Yehuda Afek. Fence-free work stealing on
bounded TSO processors. In Architectural Support for Program-
ming Languages and Operating Systems, ASPLOS ’14, Salt Lake
City, UT, USA, March 1-5, 2014, pages 413–426, 2014.
[MA16] Stefan K. Muller and Umut A. Acar. Latency-hiding work steal-
ing: Scheduling interacting parallel computations with work
stealing. In Proceedings of the 28th ACM Symposium on Par-
allelism in Algorithms and Architectures, SPAA 2016, Asilomar
State Beach/Pacific Grove, CA, USA, July 11-13, 2016, pages
71–82, 2016.
[Mit98] Michael Mitzenmacher. Analyses of load stealing models based
on differential equations. In SPAA, pages 212–221, 1998.
[MPS02] Michael Mitzenmacher, Balaji Prabhakar, and Devavrat Shah.
Load balancing with memory. In 43rd Symposium on Founda-
tions of Computer Science (FOCS 2002), 16-19 November 2002,
Vancouver, BC, Canada, Proceedings, pages 799–808, 2002.
[MVS09] Maged M. Michael, Martin T. Vechev, and Vijay A. Saraswat.
Idempotent work stealing. In Proceedings of the 14th ACM SIG-
PLAN Symposium on Principles and Practice of Parallel Pro-
gramming, PPOPP 2009, Raleigh, NC, USA, February 14-18,
2009, pages 45–54, 2009.
[QW10] Jean-Noël Quintin and Frédéric Wagner. Hierarchical work-
stealing. In Euro-Par 2010 - Parallel Processing, 16th Interna-
tional Euro-Par Conference, Ischia, Italy, August 31 - September
3, 2010, Proceedings, Part I, pages 217–229, 2010.
[SLS16] Warut Suksompong, Charles E. Leiserson, and Tao B. Schardl.
On the efficiency of localized work stealing. Inf. Process. Lett.,
116(2):100–106, 2016.
24
[TGT+10] Marc Tchiboukdjian, Nicolas Gast, Denis Trystram, Jean-Louis
Roch, and Julien Bernard. A tighter analysis of work stealing.
In Algorithms and Computation - 21st International Symposium,
ISAAC 2010, Jeju Island, Korea, December 15-17, 2010, Pro-
ceedings, Part II, pages 291–302, 2010.
[vDvdP14] Tom van Dijk and Jaco C. van de Pol. Lace: Non-blocking split
deque for work-stealing. In Euro-Par 2014: Parallel Process-
ing Workshops - Euro-Par 2014 International Workshops, Porto,
Portugal, August 25-26, 2014, Revised Selected Papers, Part II,
pages 206–217, 2014.
25
Algorithm 2 The WSS algorithm
1: procedure Scheduler( )
2: while not finished (computation) do
3: if ValidNode (self.assigned) then
4: enabled← execute (self.assigned)
5: assigned← none
6: synch(max_phaseI_length, ι())
7: if length (enabled) > 0 then
8: self.assigned← enabled [0]
9: if length (enabled) = 2 then
10: self .handleExtraNode (enabled [1])
11: else
12: synch(max_phaseII_length, ι())
13: end if
14: else
15: synch(max_phaseII_length, ι())
16: self.assigned← self.deque.popBottom ()
17: if not ValidNode (self.assigned) then
18: self.state← idle
19: end if
20: end if
21: else
22: self .loadBalance ()
23: end if
24: synch(max_phaseIII_length, ι())
25: end while
26: end procedure
27: procedure handleExtraNode(µ)
28: self.donation← µ
29: donee← UniformlyRandomProcessor()
30: result← CAS (donee.state, idle, self.id)
31: synch(max_phaseII_length, ι())
32: if result 6= success then
33: self.donation← none
34: self.deque.pushBottom (µ)
35: end if
36: end procedure
37: procedure loadBalance( )
38: victim← UniformlyRandomProcessor()
39: self.assigned← victim.deque.popTop ()
40: if ValidNode (self.assigned) then
41: self.state← working
42: end if
43: synch(max_phaseI_length, ι())
44: synch(max_phaseII_length, ι())
45: if self.state 6= idle and self.state 6= working then
46: donor ← processor [self.state]
47: self.assigned← donor.donation
48: self.state← working
49: end if
50: end procedure
26
A Full proofs for the results obtained in Section 1
A.1 Full proof for Lemma 2.5
Claim A.1. For any step i and processor p, M−i (p) ⊆M
+
i (Procs− {p}).
Proof. By Definition 2.2
M−i (p) = Ri+1(p) ∩ (Ri −Ri(p))
=

Ri+1 −

 ⋃
q∈Procs−{p}
Ri (q)



 ∩

 ⋃
q∈Procs−{p}
Ri (q)


⊆ Ri+1 ∩

 ⋃
q∈Procs−{p}
Ri (q)


⊆

 ⋃
q∈Procs−{p}
Ri (q) ∩Ri+1 ∩Ri+1 (q)


= M+i (Procs− {p})

Proof of Lemma 2.5. Claim A.1 implies that for any step i ∈ {t[0], · · · , t[L−
1]}, we haveM−i (p) ⊆M
+
i (Procs−{p}). Thus, by Definition 2.4 we conclude
this lemma holds. 
A.2 Full proof for Lemma 2.6 (Round Progression Lemma)
Claim A.2. For any round t and processor p ∈ Procs, Rt+1 (p)∩M
+
t
(p) = ∅.
Proof. For the purpose of contradiction, assume Rt+1 (p)∩M
+
t
(p) 6= ∅. Thus,
there is a step j ∈
{
t [0] , . . . , t [L− 1]
}
such that Rt+1 (p)∩M
+
j (p) 6= ∅. For
such step j, let S = Rt+1 (p) ∩M
+
j (p). Then,
S = Rt+1[0] (p) ∩M
+
j (p)
= Rt+1[0] (p) ∩ (Rj (p) ∩ (Rj+1 −Rj+1 (p)))
= Rt+1[0] (p) ∩Rj (p) ∩Rj+1 ∩Rj+1 (p)
If j were t [L− 1], then S = ∅, and so, as one can deduce, j < t [L− 1]. Now,
consider a node µ ∈ S. It follows, µ ∈ Rt+1[0] (p) ∩Rj (p) ∩Rj+1 ∩Rj+1 (p).
Since a node that is ready can only become executed, and a node in state
executed does not change its state, it follows ∀i ∈
{
j, . . . , t+ 1 [0]
}
, µ ∈ Ri.
Moreover, as µ ∈ Rj+1 (p)∩Rsl (p) and j+1 < t [L− 1], it follows that there
is a step k ∈
{
j + 1, . . . , t [L− 1]
}
such that µ ∈ Rk+1 (p) ∩Rk (p)∩Rk. By
27
Definition 2.2, it follows µ ∈ M−k (p), implying µ ∈ M
−
t
(p). However, since
µ ∈M−
t
(p) and µ ∈M+
t
(p), it followsM−
t
(p)∩M+
t
(p) 6= ∅, which, together
with Lemma 2.5, contradicts Definition 2.3 — the definition of rounds. 
Lemma A.3. For any step i and p ∈ Procs, Ri+1 (p) ⊆ Ri (p) ∪ Ei (p) ∪
M−i (p).
Proof of Lemma A.3. By Definition 2.2, it follows
Ri (p) ∪ Ei (p) ∪M
−
i (p)
= Ri (p) ∪ (Ri+1 (p)−Ri) ∪ (Ri+1 (p) ∩ (Ri −Ri (p)))
= Ri (p) ∪
(
Ri+1 (p) ∩Ri
)
∪
(
Ri+1 (p) ∩Ri ∩Ri (p)
)
= Ri (p) ∪
[
Ri+1 (p) ∩
(
Ri ∪
(
Ri ∩Ri (p)
))]
= [Ri (p) ∪Ri+1 (p)] ∩
[
Ri (p) ∪Ri ∪
(
Ri ∩Ri (p)
)]
= [Ri (p) ∪Ri+1 (p)] ∩
[(
Ri (p) ∪Ri
)
∪
(
Ri ∩Ri (p)
)]
= [Ri (p) ∪Ri+1 (p)] ∩
[(
Ri ∪
(
Ri (p) ∪Ri
))
∩
((
Ri (p) ∪Ri
)
∪Ri (p)
)]
= Ri (p) ∪Ri+1 (p)

Lemma A.4. For any steps s0, s1, with s1 > s0, and processor p ∈ Procs,
 ⋃
i∈{s0,...,s1}
Ri (p)

 ⊆ Rs0 (p) ∪

 ⋃
i∈{s0,...,s1−1}
Ei (p) ∪M
−
i (p)

 .
Proof of Lemma A.4. Prove this lemma by induction.
Base case For the base case, let s1 = s0 + 1. Then,
 ⋃
i∈{s0,...,s1}
Ri (p)

 ⊆ Rs0 (p) ∪

 ⋃
i∈{s0,...,s1−1}
Ei (p) ∪M
−
i (p)


iff Rs0 (p) ∪Rs1 (p) ⊆ Rs0 (p) ∪ Es0 (p) ∪M
−
s0
(p). Taking into account
Lemma A.3, we conclude the base case holds.
Induction step Assume that the result holds for some sl > s0, and show
that it also holds for sl + 1. The induction hypothesis is
 ⋃
i∈{s0,...,sl}
Ri (p)

 ⊆ Rs0 (p) ∪

 ⋃
i∈{s0,...,sl−1}
Ei (p) ∪M
−
i (p)


28
and prove
 ⋃
i∈{s0,...,sl+1}
Ri (p)

 ⊆ Rs0 (p) ∪

 ⋃
i∈{s0,...,(sl+1)−1}
Ei (p) ∪M
−
i (p)


= Rs0 (p) ∪

 ⋃
i∈{s0,...,sl−1}
Ei (p) ∪M
−
i (p)


∪ Esl (p) ∪M
−
sl
(p)
⊇

 ⋃
i∈{s0,...,sl}
Ri (p)

 ∪Esl (p) ∪M−sl (p)
Again, taking into account Lemma A.3, it is easy to deduce that the
induction hypothesis holds, implying the lemma holds.

Lemma A.5. For any round t and processor p ∈ Procs
Rt+1 (p) ⊆
(
Et (p) ∪Rt (p) ∪M
−
t
(p)
)
−
(
Ct (p) ∪M
+
t
(p)
)
Proof of Lemma A.5. By Definitions 2.1, 2.2 and 2.4, it follows Rt+1 (p) ∩
Ct (p) = ∅. Since by Claim A.2, Rt+1 (p) ∩M
+
t
(p) = ∅, it follows Rt+1 (p)∩(
Ct (p) ∪M
+
t
(p)
)
= ∅. Thus, it suffices to show that Rt+1 (p) ⊆ Et (p) ∪
Rt (p)∪M
−
t
(p). To conclude this proof, note that Lemma A.4, with s0 = t [0]
and s1 = t+ 1 [0], implies just that. 
Claim A.6. For any round t and processor p ∈ Procs, Rt+1 (p) ⊇M
−
t
(p)−(
Ct (p) ∪M
+
t
(p)
)
.
Proof. First, for an arbitrary step s0 prove by induction that for any step s1
such that s1 > s0,
Rs1(p) ⊇

 ⋃
i∈{s0,...,s1−1}
M−i (p)

−

 ⋃
i∈{s0,...,s1−1}
Ci (p) ∪M
+
i (p)


Base case For the base case, let s1 = s0 + 1. Then,
Rs1(p) ⊇

 ⋃
i∈{s0,...,s1−1}
M−i (p)

−

 ⋃
i∈{s0,...,s1−1}
Ci (p) ∪M
+
i (p)


29
iff Rs1(p) ⊇ M
−
s0
(p) −
(
Cs0 (p) ∪M
+
s0
(p)
)
. To conclude the proof of
the base case, note that
M−s0 (p)−
(
Cs0 (p) ∪M
+
s0
(p)
)
⊆M−s0 (p)
= Rs1 (p) ∩ (Rs0 −Rs0 (p))
⊆ Rs1 (p)
Induction step Assume that the result holds for some sl > s0, and show
that it also holds for sl + 1. The induction hypothesis is
Rsl(p) ⊇

 ⋃
i∈{s0,...,sl−1}
M−i (p)

−

 ⋃
i∈{s0,...,sl−1}
Ci (p) ∪M
+
i (p)


and we prove
Rsl+1(p) ⊇

 ⋃
i∈{s0,...,(sl+1)−1}
M−i (p)

−

 ⋃
i∈{s0,...,(sl+1)−1}
Ci (p) ∪M
+
i (p)


=

M−sl (p) ∪

 ⋃
i∈{s0,...,sl−1}
M−i (p)




−

 ⋃
i∈{s0,...,(sl+1)−1}
Ci (p) ∪M
+
i (p)


=

M−sl (p)−

 ⋃
i∈{s0,...,(sl+1)−1}
Ci (p) ∪M
+
i (p)




∪



 ⋃
i∈{s0,...,sl−1}
M−i (p)

−

 ⋃
i∈{s0,...,(sl+1)−1}
Ci (p) ∪M
+
i (p)




⊆M−sl (p) ∪



 ⋃
i∈{s0,...,sl−1}
M−i (p)

−

 ⋃
i∈{s0,...,(sl+1)−1}
Ci (p) ∪M
+
i (p)




= M−sl (p) ∪



 ⋃
i∈{s0,...,sl−1}
M−i (p)


−



 ⋃
i∈{s0,...,sl−1}
Ci (p) ∪M
+
i (p)

 ∪ Csl (p) ∪M+sl (p)




30
= M−sl (p) ∪





 ⋃
i∈{s0,...,sl−1}
M−i (p)


−

 ⋃
i∈{s0,...,sl−1}
Ci (p) ∪M
+
i (p)



− (Csl (p) ∪M+sl (p))


⊆M−sl (p) ∪
(
Rsl (p)−
(
Csl (p) ∪M
+
sl
(p)
))
= (Rsl+1 (p) ∩ (Rsl −Rsl (p))) ∪
(
Rsl (p)−
(
Csl (p) ∪M
+
sl
(p)
))
⊆ Rsl+1 (p) ∪
(
Rsl (p)−
(
Csl (p) ∪M
+
sl
(p)
))
= Rsl+1 (p) ∪
(
Rsl (p) ∩ Csl (p) ∩M
+
sl (p)
)
= Rsl+1 (p) ∪
(
Rsl (p) ∩
(
Rsl (p)−Rsl+1
)
∩
(
Rsl (p) ∩ (Rsl+1 −Rsl+1 (p))
))
= Rsl+1 (p) ∪
(
Rsl (p) ∩
(
Rsl (p) ∪Rsl+1
)
∩
(
Rsl (p) ∪Rsl+1 ∪Rsl+1 (p)
))
= Rsl+1 (p) ∪
[((
Rsl (p) ∩Rsl (p)
)
∪ (Rsl (p) ∩Rsl+1)
)
∩
(
Rsl (p) ∪Rsl+1 ∪Rsl+1 (p)
)]
= Rsl+1 (p) ∪
[
Rsl (p) ∩Rsl+1 ∩
(
Rsl (p) ∪Rsl+1 ∪Rsl+1 (p)
)]
= Rsl+1 (p) ∪
[
Rsl+1 ∩
((
Rsl (p) ∩Rsl (p)
)
∪
(
Rsl (p) ∩Rsl+1
)
∪ (Rsl (p) ∩Rsl+1 (p))
)]
⊆ Rsl+1 (p) ∪
[
Rsl+1 ∩
((
Rsl (p) ∩Rsl+1
)
∪Rsl+1 (p)
)]
⊆ Rsl+1 (p) ∪
[(
Rsl+1 ∩Rsl (p) ∩Rsl+1
)
∪ (Rsl+1 ∩Rsl+1 (p))
]
⊆ Rsl+1 (p) ∪ (Rsl+1 ∩Rsl+1 (p))
= Rsl+1 (p)
To conclude this proof, let s0 = t [0] and s1 = t+ 1 [0]. 
Claim A.7. For any steps s0, s1 such that s1 > s0:
Rs1 (p)∪

 ⋃
i∈{s0,...,s1−1}
Ci (p)

 ⊇

 ⋂
i∈{s0,...,s1−1}
Rs0 (p) ∩
(
Ri (p) ∪Ri+1 ∪Ri+1 (p)
)
Proof. Prove this claim for an arbitrary s0 by induction on s1.
Base case For the base case, consider s1 = s0 + 1. Then
Rs1 (p) ∪

 ⋃
i∈{s0,...,s1−1}
Ci (p)

 ⊇

 ⋂
i∈{s0,...,s1−1}
Rs0 (p) ∩
(
Ri (p) ∪Ri+1 ∪Ri+1 (p)
)
iff
Rs1 (p) ∪ Cs0 (p) ⊇ Rs0 (p) ∩
(
Rs0 (p) ∪Rs0+1 ∪Rs0+1 (p)
)
31
To conclude the proof of the base case, note that
Rs0 (p) ∩
(
Rs0 (p) ∪Rs0+1 ∪Rs0+1 (p)
)
=
(
Rs0 (p) ∩Rs0 (p)
)
∪
(
Rs0 (p) ∩Rs1
)
∪ (Rs0 (p) ∩Rs1 (p))
⊆ Cs0 (p) ∪Rs1 (p)
Induction step Assuming the claim holds for sl ≥ s0 + 1, prove the claim
also holds for sl + 1. Since by the induction hypothesis
Rsl (p)∪

 ⋃
i∈{s0,...,sl−1}
Ci (p)

 ⊇

 ⋂
i∈{s0,...,sl−1}
Rs0 (p) ∩
(
Ri (p) ∪Ri+1 ∪Ri+1 (p)
)
it suffices to show that
Rsl+1 (p) ∪

 ⋃
i∈{s0,...,(sl+1)−1}
Ci (p)

 ⊇

Rsl (p) ∪

 ⋃
i∈{s0,...,sl−1}
Ci (p)




∩
[
Rs0 (p) ∩
(
Rsl (p) ∪Rsl+1 ∪Rsl+1 (p)
)]
To conclude,(
Rsl (p) ∪
[ ⋃
i∈{s0,...,sl−1}
Ci (p)
])
∩
[
Rs0 (p) ∩
(
Rsl (p) ∪Rsl+1 ∪Rsl+1 (p)
)]
⊆
(
Rsl (p) ∪
[ ⋃
i∈{s0,...,sl−1}
Ci (p)
])
∩
(
Rsl (p) ∪Rsl+1 ∪Rsl+1 (p)
)
=

[ ⋃
i∈{s0,...,sl−1}
Ci (p)
]
∩
(
Rsl (p) ∪Rsl+1 ∪Rsl+1 (p)
)
∪
((
Rsl (p) ∩Rsl (p)
)
∪
(
Rsl (p) ∩Rsl+1
)
∪ (Rsl (p) ∩Rsl+1 (p))
)
⊆
[ ⋃
i∈{s0,...,sl−1}
Ci (p)
]
∪ Csl (p) ∪Rsl+1 (p)
⊆ Rsl+1 (p) ∪
[ ⋃
i∈{s0,...,sl−1}
Ci (p)
]

Claim A.8. For any steps s0, s1 such that s1 > s0,
Rs1 (p)∪

 ⋃
i∈{s0,...,s1−1}
Ci (p)

 ⊇

 ⋃
i∈{s0,...,s1−1}
Ei (p)

−

 ⋃
i∈{s0,...,s1−1}
M+i (p)


32
Proof. Prove this claim for an arbitrary s0 by induction on s1.
Base case For the base case, consider s1 = s0 + 1. Then
Rs1 (p) ∪

 ⋃
i∈{s0,...,s1−1}
Ci (p)

 ⊇

 ⋃
i∈{s0,...,s1−1}
Ei (p)

−

 ⋃
i∈{s0,...,s1−1}
M+i (p)


iff
Rs1 (p) ∪ Cs0 (p) ⊇ Es0 (p)−M
+
s0
(p)
To conclude the proof of the base case, note that by Definition 2.2
Es0 (p)−M
+
s0
(p)
= (Rs0+1 (p)−Rs0)− (Rs0 (p) ∩ (Rs0+1 −Rs0+1 (p)))
⊆ Rs1 (p)
Induction step Assuming the claim is true for sl ≥ s0 + 1 show that it
holds for sl + 1. Thus, using the induction hypothesis
Rsl (p)∪

 ⋃
i∈{s0,...,sl−1}
Ci (p)

 ⊇

 ⋃
i∈{s0,...,sl−1}
Ei (p)

−

 ⋃
i∈{s0,...,sl−1}
M+i (p)


show that
Rsl+1 (p) ∪

 ⋃
i∈{s0,...,(sl+1)−1}
Ci (p)

 ⊇

 ⋃
i∈{s0,...,(sl+1)−1}
Ei (p)

−

 ⋃
i∈{s0,...,(sl+1)−1}
M+i (p)


33
It follows
 ⋃
i∈{s0,...,(sl+1)−1}
Ei (p)

−

 ⋃
i∈{s0,...,(sl+1)−1}
M+i (p)


=

 ⋃
i∈{s0,...,(sl+1)−1}
Ei (p)

 ∩

 ⋃
i∈{s0,...,(sl+1)−1}
M+i (p)


=

 ⋃
i∈{s0,...,(sl+1)−1}
Ei (p)

 ∩

 ⋂
i∈{s0,...,(sl+1)−1}
M+i (p)


= M+sl (p) ∩



 ⋃
i∈{s0,...,(sl+1)−1}
Ei (p)

 ∩

 ⋂
i∈{s0,...,sl−1}
M+i (p)




= M+sl (p) ∩



Esl (p) ∪

 ⋃
i∈{s0,...,sl−1}
Ei (p)



 ∩

 ⋂
i∈{s0,...,sl−1}
M+i (p)




= M+sl (p) ∩



Esl (p) ∪

 ⋃
i∈{s0,...,sl−1}
Ei (p)



 ∩

 ⋃
i∈{s0,...,sl−1}
M+i (p)




= M+sl (p) ∩



Esl (p) ∩

 ⋃
i∈{s0,...,sl−1}
M+i (p)




∪



 ⋃
i∈{s0,...,sl−1}
Ei (p)

 ∩

 ⋃
i∈{s0,...,sl−1}
M+i (p)






⊆M+sl (p) ∩



Esl (p) ∩

 ⋃
i∈{s0,...,sl−1}
M+i (p)



 ∪

Rsl (p) ∪

 ⋃
i∈{s0,...,sl−1}
Ci (p)






⊆
(
Rsl (p) ∪Rsl+1 ∪Rsl+1 (p)
)
∩



(Rsl+1 (p)−Rsl)−

 ⋃
i∈{s0,...,sl−1}
M+i (p)



 ∪

Rsl (p) ∪

 ⋃
i∈{s0,...,sl−1}
Ci (p)






34
⊆
(
Rsl (p) ∪Rsl+1 ∪Rsl+1 (p)
)
∩

Rsl+1 (p) ∪Rsl (p) ∪

 ⋃
i∈{s0,...,sl−1}
Ci (p)




=

Rsl (p) ∪Rsl+1 (p) ∪

 ⋃
i∈{s0,...,sl−1}
Ci (p)



 ∩ (Rsl (p) ∪Rsl+1 ∪Rsl+1 (p))
=



Rsl+1 (p) ∪

 ⋃
i∈{s0,...,sl−1}
Ci (p)



 ∩ (Rsl (p) ∪Rsl+1 ∪Rsl+1 (p))


∪
[
Rsl (p) ∩
(
Rsl (p) ∪Rsl+1 ∪Rsl+1 (p)
)]
⊆ Rsl+1 (p) ∪

 ⋃
i∈{s0,...,sl−1}
Ci (p)


∪
[(
Rsl (p) ∩Rsl (p)
)
∪
(
Rsl (p) ∩Rsl+1
)
∪ (Rsl (p) ∩Rsl+1 (p))
]
⊆ Rsl+1 (p) ∪

 ⋃
i∈{s0,...,sl−1}
Ci (p)

 ∪ (Rsl (p)−Rsl+1) ∪Rsl+1 (p)
= Rsl+1 (p) ∪

 ⋃
i∈{s0,...,sl−1}
Ci (p)

 ∪ Csl (p)
= Rsl+1 (p) ∪

 ⋃
i∈{s0,...,(sl+1)−1}
Ci (p)



Lemma A.9. For any round t and processor p ∈ Procs
Rt+1 (p) ⊇
(
Et (p) ∪Rt (p) ∪M
−
t
(p)
)
−
(
Ct (p) ∪M
+
t
(p)
)
Proof of Lemma A.9. To prove this direction of the lemma, it suffices to
show:
1. Rt+1 (p) ⊇M
−
t
(p)−
(
Ct (p) ∪M
+
t
(p)
)
2. Rt+1 (p) ⊇ Rt (p)−
(
Ct (p) ∪M
+
t
(p)
)
3. Rt+1 (p) ⊇ Et (p)−
(
Ct (p) ∪M
+
t
(p)
)
Prove each of these propositions:
1. Claim A.6 implies Proposition 1 holds.
35
2. To prove Proposition 2:
Rt+1 (p) ⊇ Rt (p)−
(
Ct (p) ∪M
+
t
(p)
)
=
(
Rt (p)−M
+
t
(p)
)
− Ct (p)
=

Rt[0] (p)−

 ⋃
i∈{t[0],...,t[L−1]}
M+i (p)



− Ct (p)
=

Rt[0] (p)−

 ⋃
i∈{t[0],...,t[L−1]}
Ri (p) ∩ (Ri+1 −Ri+1 (p))



− Ct (p)
=

Rt[0] (p) ∩

 ⋂
i∈{t[0],...,t[L−1]}
Ri (p) ∩ (Ri+1 −Ri+1 (p))



− Ct (p)
=

Rt[0] (p) ∩

 ⋂
i∈{t[0],...,t[L−1]}
Ri (p) ∪Ri+1 ∪Ri+1 (p)



− Ct (p)
=

 ⋂
i∈{t[0],...,t[L−1]}
Rt[0] (p) ∩
(
Ri (p) ∪Ri+1 ∪Ri+1 (p)
)− Ct (p)
By Claim A.7, letting s0 = t [0] and s1 = t+ 1 [0], it follows
Rt+1 (p) ⊇
(
Rt+1 (p) ∪ Ct (p)
)
− Ct (p) ⊇
(
Rt (p)−M
+
t
(p)
)
− Ct (p) .
3. By Claim A.8, letting s0 = t [0] and s1 = t+ 1 [0], it follows
Rt+1 (p) ∪ Ct (p) ⊇ Et (p)−M
+
t
(p)
To conclude this proof note that
Rt+1 (p) ⊇
(
Rt+1 (p) ∪ Ct (p)
)
− Ct (p) ⊇
(
Et (p)−M
+
t
(p)
)
− Ct (p)

Proof of Lemma 2.6. Lemmas A.5 and A.9 imply this result. 
A.3 Full proof for Lemma 2.13 (Connecting Lemma)
Claim A.10. For any round t and p ∈ Procs, Et (p) ∩Rt (p) = ∅.
36
Proof. Given an arbitrary step s0, we prove by induction on a step s1 (where
s1 > s0) that
[⋃
i∈{s0,...,s1−1}
Ei (p)
]
∩Rs0 (p) = ∅.
Base case Let s1 = s0 + 1. Then
[⋃
i∈{s0,...,s1−1}
Ei (p)
]
∩ Rs0 (p) = ∅ iff
Es0 (p) ∩Rs0 (p) = ∅. By Definition 2.2, it follows
Es0 (p) ∩Rs0 (p)
= (Rs0+1(p)−Rs0) ∩Rs0 (p)
= Rs1(p) ∩Rs0 ∩Rs0 (p)
⊆ Rs0 ∩Rs0
= ∅.
Induction step To prove the induction step, assume the lemma holds for
a step sl > s0 and then prove that it also holds for sl+1.
 ⋃
i∈{s0,...,(sl+1)−1}
Ei (p)

 ∩Rs0 (p)
=



 ⋃
i∈{s0,...,sl−1}
Ei (p)

 ∩Rs0 (p)

 ∪ (Esl (p) ∩Rs0 (p))
= Esl (p) ∩Rs0 (p)
Since sl > s0, by Definition 2.1 it follows Esl ∩ Rs0 = ∅, implying
Esl (p) ∩Rs0 (p) = ∅.
To conclude the proof, let s0 = t [0] and s1 = t+ 1 [0]. 
Claim A.11. For any round t and p ∈ Procs, (Rt (p) ∪ Et (p))∩M
−
t
(p) = ∅.
Proof. For the purpose of contradiction, let us assume (Rt (p) ∪ Et (p)) ∩
M−
t
(p) 6= ∅. Then, there must be a step j ∈
{
t [0] , . . . , t [L− 1]
}
such
that (Rt (p) ∪ Et (p)) ∩ M
−
j (p) 6= ∅. Thus, at least one of the following
propositions has to hold:
1. Rt (p) ∩M
−
j (p) 6= ∅;
2. Et (p) ∩M
−
j (p) 6= ∅.
To conclude the proof of this claim, we prove that none of the propositions
holds, contradicting our hypothesis that (Rt (p) ∪ Et (p)) ∩M
−
t
(p) 6= ∅.
37
Contradiction for Proposition 1 Let S = Rt (p) ∩M
−
j (p). Then,
S = Rt (p) ∩M
−
j (p)
= Rt[0] (p) ∩ (Rj+1 (p) ∩ (Rj −Rj (p)))
= Rt[0] (p) ∩Rj+1 (p) ∩Rj ∩Rj (p)
If j were t [0], then S = ∅, and so, j > t [0]. Consider any node
µ ∈ S. Since a node that is ready can only become executed,
and a node in state executed does not change its state, it follows
∀i ∈
{
t [0] , . . . , j + 1
}
, µ ∈ Ri. Noting that µ ∈ Rt[0] (p) ∩ Rj (p) ∩
Rj , then there must be a step k ∈
{
t [0] , . . . , j − 1
}
such that µ ∈
Rk (p) ∩ Rk+1 (p). Because ∀i ∈
{
t [0] , . . . , j + 1
}
, µ ∈ Ri, it follows
µ ∈ Rk (p)∩Rk+1 (p)∩Rk+1. By Definition 2.2, it follows µ ∈M
+
k (p),
implying µ ∈ M+
t
(p). However, since µ ∈ M−
t
(p) and µ ∈ M+
t
(p), it
follows M−
t
(p)∩M+
t
(p) 6= ∅, which, together with Lemma 2.5, contra-
dicts Definition 2.3 — the definition of a round.
Contradiction for Proposition 2 If Et (p)∩M
−
j (p) 6= ∅, then there must
be a step m ∈
{
t [0] , . . . , t [L− 1]
}
such that Em (p)∩M
−
j (p) 6= ∅. Let
S = Em (p) ∩M
−
j (p). It follows
S = Em (p) ∩M
−
j (p)
= (Rm+1 (p)−Rm) ∩ (Rj+1 (p) ∩ (Rj −Rj (p)))
= Rm+1 (p) ∩Rm ∩Rj+1 (p) ∩Rj ∩Rj (p)
Consider any node µ ∈ S.
If a node is not in state ready at step m, then it is either in state not
ready or executed. Thus, at step m, µ is either in state not ready
or executed. Because at step m+ 1 µ is in state ready, and since a
node that is in state executed does not change its state, we deduce
that µ is in state not ready at step m. Definition 2.1 then implies
that until step m (including m), µ has been in state not ready.
By Definition 2.2, a node can only be migrated at some step i if it is
ready at step i. Since µ is migrated at step j, then it must be ready
at that step, implying m < j. Furthermore, if m = j − 1, then S = ∅,
and so, it follows m ∈
{
t [0] , . . . , j − 2
}
.
Since a node that is ready can only become executed, and a node
in state executed does not change its state, µ ∈ S implies ∀i ∈
{m+ 1, . . . , j + 1} , µ ∈ Ri. Moreover, as µ ∈ Rm+1 (p) ∩ Rj (p) and
m < j − 1, it follows that there is a step k ∈ {m+ 1, . . . , j} such that
38
µ ∈ Rk (p) ∩ Rk+1 (p). Because ∀i ∈ {m+ 1, . . . , j + 1} , µ ∈ Ri, it
follows µ ∈ Rk (p) ∩Rk+1 (p) ∩Rk+1.
By Definition 2.2, it follows µ ∈ M+k (p), implying µ ∈ M
+
t
(p). How-
ever, since µ ∈M−
t
(p) and µ ∈M+
t
(p), it followsM−
t
(p)∩M+
t
(p) 6= ∅,
which, together with Lemma 2.5, contradicts Definition 2.3 — the def-
inition of rounds.

Claim A.12. For any round t and p ∈ Procs, Ct (p) ∩M
+
t
(p) = ∅.
Proof. Let s0 = t [0]. We prove by induction on a step s1 ∈
{
t [0] + 1, . . . , t [L− 1] + 1
}
that
[⋃
i∈{s0,...,s1−1}
Ci (p)
]
∩
[⋃
i∈{s0,...,s1−1}
M+i (p)
]
= ∅.
Base case Let s1 = s0+1. Then
[⋃
i∈{s0,...,s1−1}
Ci (p)
]
∩
[⋃
i∈{s0,...,s1−1}
M+i (p)
]
=
∅ iff Cs0 (p) ∩M
+
s0
(p) = ∅. By Definition 2.2, it follows
Cs0 (p) ∩M
+
s0
(p)
= (Rs0 (p)−Rs1) ∩ (Rs0 (p) ∩ (Rs1 −Rs1 (p)))
= Rs0 (p) ∩Rs1 ∩Rs0 (p) ∩Rs1 ∩Rs1 (p)
= ∅.
Induction step To prove the induction step, assume the lemma holds for
a step sl > s0 and then prove that it also holds for sl + 1, where
(sl + 1) ∈
{
t [0] + 1, . . . , t [L− 1] + 1
}
. The induction hypothesis is
 ⋃
i∈{s0,...,sl−1}
Ci (p)

 ∩

 ⋃
i∈{s0,...,sl−1}
M+i (p)

 = ∅.
39
Thus,
 ⋃
i∈{s0,...,(sl+1)−1}
Ci (p)

 ∩

 ⋃
i∈{s0,...,(sl+1)−1}
M+i (p)


=

Csl (p) ∪

 ⋃
i∈{s0,...,sl−1}
Ci (p)



 ∩

M+sl (p) ∪

 ⋃
i∈{s0,...,sl−1}
M+i (p)




=

Csl (p) ∩

M+sl (p) ∪

 ⋃
i∈{s0,...,sl−1}
M+i (p)






∪



 ⋃
i∈{s0,...,sl−1}
Ci (p)

 ∩

M+sl (p) ∪

 ⋃
i∈{s0,...,sl−1}
M+i (p)






=
(
M+sl (p) ∩ Csl (p)
)
∪



 ⋃
i∈{s0,...,sl−1}
M+i (p)

 ∩Csl (p)


∪

M+sl (p) ∩

 ⋃
i∈{s0,...,sl−1}
Ci (p)




∪



 ⋃
i∈{s0,...,sl−1}
Ci (p)

 ∩

 ⋃
i∈{s0,...,sl−1}
M+i (p)




=
(
M+sl (p) ∩ Csl (p)
)
∪



 ⋃
i∈{s0,...,sl−1}
M+i (p)

 ∩Csl (p)


∪

M+sl (p) ∩

 ⋃
i∈{s0,...,sl−1}
Ci (p)




= ((Rsl (p)−Rsl+1) ∩ (Rsl (p) ∩ (Rsl+1 −Rsl+1 (p)))) ∪



 ⋃
i∈{s0,...,sl−1}
M+i (p)

 ∩ Csl (p)


∪

M+sl (p) ∩

 ⋃
i∈{s0,...,sl−1}
Ci (p)




=
(
Rsl (p) ∩Rsl+1 ∩Rsl (p) ∩Rsl+1 ∩Rsl+1 (p)
)
∪



 ⋃
i∈{s0,...,sl−1}
M+i (p)

 ∩ Csl (p)


∪

M+sl (p) ∩

 ⋃
i∈{s0,...,sl−1}
Ci (p)




=



 ⋃
i∈{s0,...,sl−1}
M+i (p)

 ∩ Csl (p)

 ∪

M+sl (p) ∩

 ⋃
i∈{s0,...,sl−1}
Ci (p)




40
By Definition 2.2, for any step i, Ci (p) is composed by nodes that are
attached to p at step i (implying they are ready at step i), but are
no longer in state ready at step i + 1. Thus, by Definition 2.1, since
a node that is in state ready can only change its state to executed,
and since a node that is executed can not become not ready nor
ready, it follows
[⋃
i∈{s0,...,sl−1}
Ci (p)
]
⊆ Executedsl (the set of nodes
in state executed at step sl). On the other hand, by Definition 2.2,
a node can only be migrated from p at step sl if it is ready at that
step, implying M+sl (Procs) ⊆ Rsl . With this, because a node can only
be in one state at each step, it follows Executedsl ∩Rsl = ∅, implying[⋃
i∈{s0,...,sl−1}
Ci (p)
]
∩M+sl (p) = ∅. As such, to conclude this proof
it only remains to show that
[⋃
i∈{s0,...,sl−1}
M+i (p)
]
∩ Csl (p) = ∅.
Let S =
[⋃
i∈{s0,...,sl−1}
M+i (p)
]
∩ Csl (p). For the purpose of contra-
diction, let us assume S 6= ∅. Thus, there is a step j ∈ {s0, . . . , sl − 1}
such that M+j (p) ∩ Csl (p) 6= ∅. Let µ be some node such that µ ∈
M+j (p) ∩ Csl (p). By Definition 2.2, it follows
M+j (p) ∩ Csl (p)
= (Rj (p) ∩ (Rj+1 −Rj+1 (p))) ∩ (Rsl (p)−Rsl+1)
= Rj (p) ∩Rj+1 ∩Rj+1 (p) ∩Rsl (p) ∩Rsl+1
which implies µ ∈ Rj (p) ∩Rj+1 ∩Rj+1 (p)∩Rsl (p)∩Rsl+1. If j were
sl − 1, it would follow Rj (p) ∩ Rj+1 ∩ Rj+1 (p) ∩ Rsl (p) ∩ Rsl+1 = ∅,
and so j < sl − 1. Since a node that is ready can only become
executed, and a node in state executed does not change its state,
then ∀i ∈ {j, . . . , sl − 1} , µ ∈ Ri. Moreover, as µ ∈ Rj+1 (p) ∩ Rsl (p)
and sl > j + 1, it follows that there is a step k ∈ {j + 1, . . . , sl − 1}
such that µ ∈ Rk+1 (p) ∩ Rk (p) ∩ Rk. By Definition 2.2, it follows
µ ∈ M−k (p), implying µ ∈ M
−
t
(p). However, since µ ∈ M−
t
(p) and
µ ∈ M+
t
(p), it follows M−
t
(p) ∩M+
t
(p) 6= ∅, which, together with
Lemma 2.5, contradicts Definition 2.3 (specifically, that no node is
migrated more than once during the same round).

Claim A.13. For any round t and p ∈ Procs, Rt (p) ∪ Et (p) ∪M
−
t
(p) ⊇
Ct (p) ∪M
+
t
(p).
Proof. By Requirement 2.7, it follows Rt (p)∪Et (p)∪M
−
t
(p) ⊇ Ct (p). Thus,
it suffices to show that Rt (p)∪Et (p)∪M
−
t
(p) ⊇M+
t
(p). By Definition 2.2,
for any step i, M+i (p) = Ri(p)∩(Ri+1−Ri+1(p)), implying Ri (p) ⊇M
+
i (p).
To conclude this proof, let s0 = t [0] and s1 = t [L− 1] in Lemma A.4. 
41
Proof of Lemma 2.13. First, note that |Et (p)| <
∣∣Ct+1 (p)∣∣+ ∣∣M+t (p)∣∣ iff
|Et (p)|+ |Rt (p)|+
∣∣M−
t
(p)
∣∣− |Ct (p)| − ∣∣M+t (p)∣∣ < ∣∣Ct+1 (p)∣∣+ ∣∣M+t (p)∣∣+ |Rt (p)|+ ∣∣M−t (p)∣∣
− |Ct (p)| −
∣∣M+
t
(p)
∣∣
=
∣∣Ct+1 (p)∣∣+ |Rt (p)|+ ∣∣M−t (p)∣∣− |Ct (p)|
Noting that:
1. Claim A.10 implies
|Et (p)|+ |Rt (p)| = |Et (p) ∪Rt (p)| ;
2. Claim A.11 implies
|Rt (p) ∪Et (p)|+
∣∣∣M−
t
(p)
∣∣∣ = ∣∣∣Rt (p) ∪ Et (p) ∪M−t (p)
∣∣∣ ;
3. Claim A.12 implies
|Ct (p)|+
∣∣∣M+
t
(p)
∣∣∣ = ∣∣∣Ct (p) ∪M+t (p)
∣∣∣ ;
4. Claim A.13 implies∣∣∣Et (p) ∪Rt (p) ∪M−t (p)
∣∣∣− ∣∣∣Ct (p) ∪M+t (p)
∣∣∣ =∣∣∣(Et (p) ∪Rt (p) ∪M−t (p)
)
−
(
Ct (p) ∪M
+
t
(p)
)∣∣∣ ; and
5. Lemma 2.6 implies
∣∣Rt+1 (p)∣∣ = ∣∣∣(Et (p) ∪Rt (p) ∪M−t (p)
)
−
(
Ct (p) ∪M
+
t
(p)
)∣∣∣ ,
it follows
∣∣Rt+1 (p)∣∣ = |Et (p)|+ |Rt (p)|+ ∣∣∣M−t (p)
∣∣∣− |Ct (p)| − ∣∣∣M+t (p)
∣∣∣. To
conclude this proof, note that Requirement 2.7 implies |Rt (p)| − |Ct (p)| =
|Rt (p)− Ct (p)| and
∣∣Rt+1 (p)∣∣ − ∣∣Ct+1 (p)∣∣ = ∣∣Rt+1 (p)− Ct+1 (p)∣∣, and so,
it follows ∣∣Rt+1 (p)∣∣ < ∣∣Ct+1 (p)∣∣+ |Rt (p)|+ ∣∣M−t (p)∣∣− |Ct (p)|
iff∣∣Rt+1 (p)− Ct+1 (p)∣∣ < |Rt (p)− Ct (p)|+ ∣∣M−t (p)∣∣.

42
B The lock-free deque semantics
In this section, we present the specification of the relaxed semantics associ-
ated with the lock free deque’s implementation as given in [ABP01]. The
deque implements three methods:
1. pushBottom – adds an item to the bottom of the deque and does not
return.
2. popBottom – returns the bottom-most item from the deque, or empty,
if there is no node.
3. popTop – attempts to return the topmost item from the deque, or
empty, if there is no node. If the attempt succeeds, a node is returned.
Otherwise, the special value abort is returned.
The implementation is said to be constant-time iff any invocation to each
of the three methods takes at most a constant number of steps to return,
implying the sequence of instructions composing the invocation has constant
length.
An invocation to one of the deque’s methods is defined by a 4-tuple es-
tablishing: 1. the method invoked; 2. the invocation’s beginning time; 3. the
invocation’s completion time; and 4. the return value, if it exists.
A set of invocations meets the relaxed semantics iff there is a set of
linearization times for the corresponding non-aborting invocations for which:
1. every non-aborting invocation’s linearization time lies within the initiation
and completion times of the respective invocation; 2. no two linearization
times coincide; 3. the return values for each non-aborting invocation are
consistent with a serial execution of the methods in the order given by the
linearization times of the non-aborting invocations; and 4. for each aborted
popTop invocation x to a deque d, there is another invocation removing the
topmost item from d whose linearization time falls between the beginning
and completion times of x’s invocation.
A set of invocations is said to be good iff pushBottom and popBottom
are never invoked concurrently. The deque implementation presented in
[ABP01] has been proven to satisfy the relaxed semantics on any good set of
invocations. Note that any set of invocations made during the execution of
a computation scheduled by either WS or WSS is good, as the pushBottom
and popBotom methods are exclusively invoked by the (unique) owner of
the deque. Thus, throughout the paper we simply assume that the relaxed
semantics are met.
43
C Full proofs for the results obtained in Section 4
C.1 Full proof for Lemma 4.6
First, we prove that the greater is the number of processors making steal,
the smaller are the chances that p’s spread attempts succeeds.
Lemma C.1. Let spreads (p, α, d) be a function corresponding to the ex-
pected number of nodes that p spreads during any round, where the ratio of
idle processors of the round is α and the number of processors enabling two
nodes is d. Then, spreads (p, α, d) ≥ spreads (p, α, P (1− α)).
Proof of Lemma C.1. If p targets a processor whose state flag is set towork-
ing, then its spread attempt fails. Thus, in this case p would not spread
a node, regardless of the number of processors that make a spread attempt.
However, if p targets a processor whose state flag is set to idle, then its at-
tempt has a chance to succeed. We now consider the two possible situations:
d = P (1− α) — In this case, spreads (p, α, d) = spreads (p, α, P (1− α)).
d 6= P (1− α) —By definition there are P (1− α) busy processors, implying
that d ≤ P (1− α). Thus, for this case we conclude d < P (1− α). Now,
suppose p targets some processor q whose state flag is set to idle. Thanks
to the synchronous environment we have artificially created, and assuming
that any call to UniformlyRandomNumber takes the same number of steps,
then every processor executes the CAS instruction — whose success dictates
the success of the spread attempt — at the same step (line 30 of Algorithm 2).
Finally, as a consequence of our assumptions regarding the CAS instruction
(see the first paragraph of Section 4) and since processors target donees
uniformly at random, the greater the number of spread attempts that target
q the smaller are the chances for p’s spread attempt to succeed, concluding
the proof of this lemma.

Lemma C.2. Suppose there are B bins, each of which is painted either red
or green, and let BR and BG denote the initial number of red and green bins,
respectively. Additionally, let α denote the initial ratio of red bins, meaning
α = BR
B
and B (1− α) = BG. Now, suppose there are BR cubes and BG balls.
First, each cube is tossed, independently and uniformly at random, into the
bins. After tossing all the BR cubes, count the number of cubes that landed in
green bins, and, for each such cube, a red bin is painted green. After finishing
all the paintings, each of the BG balls is tossed, independently and uniformly
at random, into the bins.
Let Y denote the number of bins that are still red, with at least one ball.
Then,
E [Y ] ≥ Bα2
(
1− e−(1−α)
)
.
44
Proof of Lemma C.2. Let CGhit and BR7→G be two random variables, corre-
sponding, respectively, to the number of cubes that land in green bins and
to the number of red bins that are painted green. Then, BR7→G = CGhit, and
thus
E [BR7→G] = E [CGhit]
= Bα (1− α) .
Similarly to Lemma 3.6, for a bin bi let Yi be an indicator variable, defined
as
Yi =
{
1 if at least one ball lands in bi;
0 otherwise.
Thus, the probability that none of the BG balls lands in bi is
P {Yi = 0} =
(
1−
1
B
)B(1−α)
≤ e−(1−α).
Since the probability that no ball lands in bi is independent from the number
of red bins painted green (i.e. Yi is independent from BR7→G), then, for any
m,
P {Yi = 0|BR7→G = m} = P {Yi = 0} , and, P {Yi = 1|BR7→G = m} = P {Yi = 1} .
It follows
E [Yi|BR7→G = m] = 0.P {Yi = 0|BR7→G = m}+ 1.P {Yi = 1|BR7→G = m}
= P {Yi = 1}
≥ 1− e−(1−α).
Consider
Y =
BR−BR 7→G∑
i=1
Yi,
corresponding to the number of bins that are still red with at least one ball.
By the linearity of expectation, it follows
E [Y |BR7→G = m] = E [Y1 + Y2 + . . .+ YBR−m|BR7→G = m]
=
BR−m∑
i=1
E [Yi|BR7→G = m]
≥
BR−m∑
i=1
(
1− e−(1−α)
)
= (BR −m)
(
1− e−(1−α)
)
.
45
To conclude this proof, by the law of total expectation it follows
E [Y ] =
BR∑
m=0
(E [Y |BR7→G = m]P {BR7→G = m})
≥
BR∑
m=0
(
(BR −m)
(
1− e−(1−α)
)
P {BR7→G = m}
)
=
(
1− e−(1−α)
) BR∑
m=0
((BR −m)P {BR7→G = m})
=
(
1− e−(1−α)
)( BR∑
m=0
(BRP {BR7→G = m})−
BR∑
m=0
(mP {BR7→G = m})
)
=
(
1− e−(1−α)
)
(BR − E [BR7→G])
=
(
1− e−(1−α)
)
(Bα−B (1− α)α)
= Bα2
(
1− e−(1−α)
)

We now obtain lower bounds on the total number of spreads (or do-
nations) made to processors during the second phase of some scheduling
iteration, assuming that all busy processors make a spread attempt.
Lemma C.3. Consider any round t during a computation’s execution, and
let Bt be the set of processors that are busy during t. If ∀p ∈ Bt, |Et (p)| = 2,
then E[|Spread+
t
(Bt) |] ≥ Pα
2
t
(
1− e−(1−αt)
)
.
Proof of Lemma C.3. We prove this result by making an analogy with Lemma C.2:
1. the number of bins B corresponds to the number of processors P ; 2. the
initial ratio of red and green bins correspond, respectively, to the ratio of
idle and busy processors during the round; 3. each cube toss corresponds to
a steal attempt; 4. each red bin that is painted green corresponds to a pro-
cessor that was idle but whose steal attempt succeeded, and thus changed
its state flag to working; and 5. each ball toss corresponds to a spread at-
tempt. Note that we can make this analogy because all steal attempts (and
consequent state flag updates) take place during the first phase of scheduling
iterations while all spread attempts take place during the second phase of
scheduling iterations. Thus, E[|Spread+
t
(Bt) |] ≥ Pα
2
t
(
1− e−(1−αt)
)
. 
Proof of Lemma 4.6. By Lemma C.1 it follows that E[|Spread+
t
(pt) |] is the
smallest if all busy processors enabled two nodes, and thus made a spread
attempt. By Lemma C.3, the expected number of nodes spread during
a round such that all busy processors make a spread attempt is at least
46
Pα2
t
(
1− e−(1−αt)
)
. Since, as we already noted, all processors have the same
chances to make a successful spread attempt, and because each spread at-
tempt may migrate at most one node, it follows that the expected number
of nodes spread by each processor that makes a spread attempt is the same.
Thus, since the expected number of nodes that p spreads is the smallest
if all processors make a spread attempt, then, letting Bt denote the set of
processors that are busy during t, it follows
E[|Spread+
t
(pt) |] =
E[|Spread+
t
(Bt) |]
P (1− αt)
≥
α2
t
1− αt
(
1− e−(1−αt)
)
.

C.2 Full proof for Lemma 4.7
Proof of Lemma 4.7. As proved in Claim A.10, Rt (p) ∩ Et (p) = ∅. To
conclude the proof of this lemma, note that, from Definitions 3.4 and 4.1,
and by the definition of Algorithm 2 we have Stolen+
t
(p) ⊆ Rt (p) and
Spread+
t
(p) ⊆ Et (p) and so the lemma holds. 
C.3 Full proof for Lemma 4.8
Claim C.4. Let
v (α) =
−2 + eα−1
(
2 + α
(
4− 5α+ α3
))
(α− 1)3
.
Then, ∀α ∈ [0.7375; 1[ v (α) ≥ 0.
Proof. Let
f (α) = −2 + eα−1
(
2 + α
(
4− 5α+ α3
))
and
g (α) = (α− 1)3.
Thus,
v (α) =
f (α)
g (α)
.
Since
df (α)
dα
= e−1+α(1− α)2
(
6 + 6α+ α2
)
it follows that ∀α ∈ [0.7375; 1[, f (α) is non-decreasing.
Consequently, ∀α ∈ [0.7375; 1[
f (α) ≤ f (1)
= −2 + e1−1
(
2 + 1
(
4− 5.1 + 13
))
= 0
47
Since, ∀α ∈ [0.7375; 1[
g (α) = (−1 + α)3 < 0
we have that, ∀α ∈ [0.7375; 1[
v (α) =
f (α)
g (α)
≥ 0,
concluding the proof of the claim. 
Proof of Lemma 4.8.
1 < 1− e−α +
α2
1− α
(
1− e−(1−α)
)
iff
0 < −e−α +
α2
1− α
(
1− e−(1−α)
)
Let
s (α) = −e−α +
α2
1− α
(
1− e−(1−α)
)
Then
ds (α)
dα
= −1 + e−α +
1 + e−1+αα
(
−2 + α2
)
(−1 + α)2
Let t (α) be defined as the last two terms of ds(α)dα :
t (α) =
1 + e−1+αα
(
−2 + α2
)
(−1 + α)2
To prove that t (α) is non-decreasing ∀α ∈ [0.7375; 1[, we compute its deriva-
tive.
dt (α)
dα
=
−2 + eα−1
(
2 + α
(
4− 5α+ α3
))
(−1 + α)3
By Claim C.4, ∀α ∈ [0.7375; 1[ we have dt(α)dα ≥ 0, meaning that t (α) is
non-decreasing for that interval.
It follows that ∀α ∈ [0.7375; 1[ we have
t (α) =
1 + e−1+αα
(
−2 + α2
)
(−1 + α)2
≥
1 + e−1+0.73750.7375
(
−2 + 0.73752
)
(−1 + 0.7375)2
> 2.5
48
Consequently,
ds (α)
dα
= −1 + e−α +
1 + e−1+αα
(
−2 + α2
)
(−1 + α)2
= −1 + e−α + t (α)
> e−α + 2.5
> 0
Thus, ∀α ∈ [0.7375; 1[, s (α) is strictly increasing. To conclude this proof, it
only remains to note that for that same interval we have
s (α) ≥ s (0.7375)
= −e−0.7375 +
0.73752
1− 0.7375
(
1− e−(1−0.7375)
)
> 0.00006
> 0

49
