Scheduling computations with provably low synchronization overheads by Rito, Guilherme & Paulino, Hervé
Scheduling computations with provably low synchronization
overheads
Guilherme Rito
ETH-Zurich
guilherme.teixeira@inf.ethz.ch
Hervé Paulino
FCT-UNL)
herve.paulino@fct.unl.pt
Abstract
Work Stealing has been a very successful algorithm for scheduling parallel computations,
and is known to achieve high performances even for computations exhibiting fine-grained
parallelism. We present a variant of Work Stealing that provably avoids most synchroniza-
tion overheads by keeping processors’ deques entirely private by default, and only exposing
work when requested by thieves. This is the first paper that obtains bounds on the syn-
chronization overheads that are (essentially) independent of the total amount of work, thus
corresponding to a great improvement, in both algorithm design and theory, over state-of-
the-art Work Stealing algorithms. Consider any computation with work T1 and critical-path
length T∞ executed by P processors using our scheduler. Our analysis shows that the ex-
pected execution time is O
(
T1
P + T∞
)
, and the expected synchronization overheads incurred
during the execution are at most O ((CCAS + CMFence)PT∞), where CCAS and CMFence
respectively denote the maximum cost of executing a Compare-And-Swap instruction and a
Memory Fence instruction.
1
ar
X
iv
:1
81
0.
10
61
5v
2 
 [c
s.D
S]
  2
6 A
pr
 20
19
1 Introduction
For some time now, the Work Stealing algorithm is one of the most popular for scheduling mul-
tithreaded computations. In Work Stealing, each worker (usually referred to as processor) owns
a double-ended queue (deque) of threads ready to execute. This deque is locally manipulated as
a stack, similarly to a sequential execution: processors push and pop threads from the bottom
side of their deque when, respectively, a new thread is spawned and the execution of the current
thread concludes. Additionally, whenever a pop operation finds the local deque empty, the pro-
cessor becomes a thief and starts targeting other processors — called its victims — uniformly
at random, with the purpose of stealing a thread from the top of their deques.
As shown by Blumofe et al. in [10], Work Stealing is provably efficient for scheduling mul-
tithreaded computations. However, due to the concurrent nature of processors’ deques, the use
of appropriate synchronization mechanisms is required to ensure correctness [8]. Consequently,
even when processors are operating locally on their deques, they incur expensive synchronization
overheads that, in most cases, are unnecessary.
The first provably efficient Work Stealing algorithm, proposed by Blumofe et al. [10], assumed
that all steal attempts targeting each deque were serialized, and only ensured the success of at
most one such attempt per time step. The idea was materialized in Cilk [9] via a blocking
synchronization protocol named THE. Despite being extremely efficient, Frigo et al. found
that the overheads introduced by the THE protocol easily account for more than half of Cilk’s
total execution time [17]. Subsequent work mitigated part of these overheads by replacing the
THE protocol with a non-blocking one that resorts to Compare-And-Swap (CAS) and Memory
Fence (MFence) instructions [7, 11]. Later, Morrison et al. tuned Cilk by removing a single
MFence instruction (one that was executed whenever a processor tried to take work from its
deque) and found that this single Memory Fence could account for as much as 25% of the
total execution time [21]. Unfortunately, it has been proved, by Attiya et al. in [8], that it is
impossible to eliminate all synchronization (e.g. the MFence instruction mentioned above) from
the implementation of any concurrent data structure that could possibly be used as a work-queue
by a Work Stealing algorithm, while maintaining correctness. Indirectly, this result implies the
impossibility of eliminating all synchronization from Work Stealing algorithms that use any fully
concurrent data structure as processor’s work-queues.
Various proposals have been made with the goal of eliminating synchronization for local
deque accesses, by making deques partly or even entirely private [2, 15, 18, 19, 21, 24, 25]. The
elimination of synchronization for local deque accesses, however, raises a new problem. Since
synchronization is required to guarantee correctness, when a processor p spawns a thread Γ and
pushes Γ (locally) to its work-queue, Γ cannot safely be stolen from p by other processors, at
least until p issues some synchronization operation [8]. So, when should a busy processor use
synchronization to permit load balancing? The subtleness of this crucial question is evidenced
by the current state-of-the-art: there is no algorithm that provably avoids most synchronization
overheads while maintaining provably good performance. On one hand, if a processor exposes
work too eagerly, then it still incurs unnecessary synchronization overheads [2, 14, 15, 19, 24, 25].
On the other hand, if a processor barely exposes any work then load balancing opportunities
become limited, thus potentially dropping the asymptotically optimal runtime guarantees of
Work Stealing [18, 21, 25]. To address this problem optimally, our algorithm follows a lazy
approach: (1) a processor p only uses synchronization to expose work when a thief directly asks
p for work, and (2) p only exposes a single unit of work (i.e. a single thread) for each time it is
asked to expose work.
1.1 Related work
Many efforts have been carried out towards reducing and even eliminating the expensive syn-
chronization present in state-of-the-art Work Stealing schedulers.
Michael et al proposes a variant of Work Stealing for idempotent computations that re-
duces synchronization overheads by relaxing the semantics of work-queues from the conventional
2
exactly-once semantics to at-least-once semantics [20]. By using work-queues that satisfy only
the weaker semantics, processors no longer have to incur in expensive synchronization overheads
when operating locally. Unfortunately, this approach (of relaxing the semantics of work-queues
to at-least-once semantics) not only is inherently limited, as it is only suitable for idempotent
computations, but also drops the provably good performance guarantees of Work Stealing.
Endo et al. was the first using split queues to avoid unnecessary synchronization [16]. In
this study, the authors present an implementation of a scalable garbage collector system that,
by using clever load balancing techniques and split queues to avoid unnecessary synchronization
overheads, achieves high performances even for large scale machines. In [14, 15], Dinan et al.
studies Work Stealing under a distributed environment and proposes the use of SpDeques to
avoid synchronization for local deque accesses. Lifflander et al. studies the execution of iterative
over-decomposed applications [19] and proposes, among others, a message-based retentive Work
Stealing algorithm adapted for the execution of iterative workloads on large scale distributed
settings. Tzannes et al. proposes a scheduling algorithm where each processor keeps all of its
work entirely private, except for the topmost node that is kept stored in a shared cell [24]. Since
the algorithm always ensures that the topmost node is shared, it does not behave appropriately
for computations in which processors frequently access the topmost nodes of their deques. As
mentioned in [2], a similar limitation has been identified for the Chase-Lev Deque [13]. Unfor-
tunately, in all these approaches (i.e. in [14, 15, 19, 24]), processors expose work too eagerly,
always leaving some work exposed for thieves to take. A consequence of this design choice is
that synchronization overheads still scale with the total amount of work.
Hiraishi et al. suggests that processors should behave as in a sequential execution [18]. Under
their scheme, deques are kept entirely private and processors only permit parallelism when an
idle processor requests work. Upon such request, the busy processor backtracks to the last
point where it could have spawned a task, spawns the task, offers it to the requesting processor,
and then proceeds with the execution. Since work requests are rare, the gains of eliminating
synchronization for local operation can surpass the extra overheads arising from backtracking.
Morrison et al. studies alternative designs to the synchronization protocols used by Work
Stealing schedulers, considering the architectures of modern TSO processors [21]. In their al-
gorithm, thieves can only steal work from a victim if such work is stored far enough from the
bottom of the victim’s deque to avoid any data race; this safe distance is computed a priori by
taking into account the size of the microprocessor’s internal store buffer. With this strategy,
not only thieves can asynchronously take work from their victims, but processors can also access
their deques locally without requiring any synchronization. Unfortunately, such scheme suffers
from a big limitation: the bottommost threads within a processor’s deque cannot be stolen, and
thus the scheduler is not appropriate for generic computations.
More recently, Dijk et al. studies the effectiveness of SpDeques on shared memory environ-
ments [25, 26]. In their approach, however, busy processors only check for work requests each
time they access their deque, which precludes any performance guarantees for generic computa-
tions. This is since the frequency at which busy processors may permit load balancing depends on
the structure of the computation. Moreover, whenever a busy processor realizes it was targeted
by a steal attempt, it exposes at least half of its work. This strategy increases the unnecessary
synchronization costs incurred by the algorithm as busy processors now have to start accessing
the shared part of their work-queue more often to fetch work.
Acar et al. presents two Work Stealing algorithms — sender- and receiver-initiated — that
avoid synchronization by making deques entirely private to each processor [2]. In addition to
promising empirical results, the authors show that the expected execution time for both algo-
rithms can be somewhat competitive with Work Stealing algorithms that use concurrent de-
ques. Unfortunately, for the sender-initiated algorithm, busy processors now have to periodically
search for idle ones, leading to unnecessary communication and synchronization overheads that
still scale with the computation’s execution time, and thus, indirectly, with the total amount
3
of work. The difference between the receiver-initiated algorithm and ours is more subtle, how-
ever. In their receiver-initiated algorithm, busy processors now have to periodically check for
incoming steal requests as well as to expose part of their current state by means of a flag that is
periodically updated, thus requiring synchronization. This contrasts with our work, where there
is no exposed state that processors periodically have to update. This difference is reflected, for
example, in a sequential execution: while our algorithm essentially does not use synchronization,
the receiver-initiated algorithm does.
1.2 Contributions
In this paper we present Low Cost Work Stealing, a variant of the Work Stealing algorithm
that uses SpDeques to provably avoid most synchronization overheads, while maintaining an
asymptotically optimal expected runtime. The theoretical significance of our contributions is
highlighted, for instance, by the tight bounds we obtain on the synchronization overheads in-
curred by our algorithm. Our bounds are essentially independent from the computation’s total
amount of work, thus contrasting with previous work. From an algorithm design perspective,
Low Cost Work Stealing greatly improves over prior Work Stealing schedulers as it shows how to
optimally use synchronization to permit provably efficient load balancing. Four of the distinctive
features of our algorithm are:
1. Busy processors only expose work to be stolen after being targeted by one or more steal
attempts. This allows processors to work locally on their work-queue without requiring
any synchronization, imposing it only when load balancing may be required.
2. Work exposure requests are attended in constant time. This is crucial to keep the al-
gorithm’s execution time bounds a constant factor away from optimal. The requirement
may be achieved by periodically checking for requests or by implementing an asynchronous
notification mechanism. For the sake of simplicity, we only focus on the former.
3. Processors only expose one thread of their local work at a time, contrasting with prior
approaches. Only so, synchronization for local operation can be eliminated when load
balancing is only sporadically required.
4. All interactions between processors are completely asynchronous, making our algorithm
viable for multiprogrammed environments.
As we will see, our analysis shows that for a P -processor execution of a computation with
total work T1 and critical-path length (i.e. span) T∞, the expected runtime of Low Cost Work
Stealing is at most O
(
T1
P + T∞
)
, and the expected synchronization overheads incurred by the
algorithm are at mostO ((CCAS + CMFence)PT∞), where CCAS and CMFence respectively denote
the synchronization costs incurred by the execution of a CAS and MFence instructions. These
bounds are tight and imply that for several classes of computations our algorithm reduces the use
of synchronization by an almost exponential factor when compared with prior provably efficient
Work Stealing algorithms.
1.3 Preliminaries
Like in much previous work [1, 2, 3, 4, 6, 7, 10, 22, 23], we model a computation as a dag (i.e.
a direct acyclic graph) G = (V,E), where each node v ∈ V corresponds to an instruction, and
each edge (µ1, µ2) ∈ E denotes an ordering between two instructions (meaning µ2 can only be
executed after µ1). Nodes with in-degree of 0 are referred to as roots, while nodes with out-degree
of 0 are called sinks. Equivalently to Arora et al in [7], we make two assumptions related with
the structure of computations. Let G denote a computation’s dag: 1. there exists only one root
and one sink in G; 2. the out-degree of any node within G is at most two (meaning that each
instruction can spawn at most one thread).
The total number of nodes within a dag is expressed by T1 and the length of a longest directed
path (i.e. the critical-path length) by T∞. A node is ready if all its ancestors have been executed,
implying that all the ordering constraints of E are satisfied. When a node becomes ready we say
that it was enabled ; to ensure correction only ready nodes can be executed. The assignment of a
node µ to a processor p means that µ will be the next node p executes. Finally, a computation’s
4
entries
official	 bottom
age
private	bottom
(a) An SpDeque with no stealable nodes
entries
official	 bottom
age
private	bottom
stealable node
(b) An SpDeque with one stealable node
Figure 1: The Split Deque
execution can be partitioned into discrete time steps, such that at each step, every processor
executes an instruction.
2 Low Cost Work Stealing
In Low Cost Work Stealing, each processor owns a Lock Free SpDeque, instead of a typical
concurrent deque. An SpDeque (illustrated in Figure 1) is simply a deque that is split into
two parts: a private part and a public part. The public part lies in the top of the SpDeque
whereas the private corresponds to the rest of the SpDeque. To avoid synchronization for local
operations, only the owner of an SpDeque is allowed to access its private part. Furthermore, by
default busy processors operate on the private part of their SpDeque, pushing and popping ready
nodes as necessary. In fact, a busy processor only attempts to fetch work from the public part
of its SpDeque if the private part is empty. In such situation, if the processor’s attempt succeeds
(i.e. if the public part of the processor’s SpDeque is not empty), the obtained node becomes the
processor’s new assigned node. However, if the public part of the processor’s SpDeque is also
empty, the processor becomes a thief and begins a stealing phase. During stealing phases, thieves
target victims uniformly at random and attempt to steal work from the top of their SpDeques.
To keep the private part of SpDeques entirely private, steal attempts are only allowed to access
the public part. Thus, when a thief attempts to steal work from a victim’s SpDeque whose public
part is empty (illustrated in Figure 1a), the steal attempt simply fails and the thief does not
obtain work. In such case, the thief then updates a victim’s flag (referred to as the targeted
flag) to (asynchronously) notify the victim that the public part of its SpDeque is empty (more
on this ahead). When the owner of the SpDeque realizes it was notified (by checking the value
of its targeted flag), it tries to transfer a node from the private part of its SpDeque to the public
part. If the private part is not empty then a node is transferred, in which case we say that the
transferred node became stealable (illustrated in Figure 1b).
2.1 The Lock-Free Split Deque
We now present the specification of an SpDeque object, along with its associated relaxed seman-
tics. Being the behavior of SpDeques similar to the behavior of concurrent deques, the SpDeque’s
relaxed semantics are comparable to the relaxed deque semantics presented in [7]. An SpDeque
object meeting the relaxed semantics supports five methods:
push — Pushes a node into the bottom of the SpDeque’s private part.
pop — Removes and returns a node from the bottom of the SpDeque’s private part, if that part
is not empty. Otherwise, returns the special value race.
updateBottom — Transfers the topmost node from the private part of the SpDeque into the
bottom of the public part, and does not return a value. The invocation has no effect if the
private part of the SpDeque is empty.
popBottom—Removes and returns the bottom-most node from the public part of the SpDeque.
If the SpDeque is empty, the invocation has no effect and empty is returned.
popTop — Attempts to remove and return the topmost node from the public part of the
SpDeque. If the public part is empty, the invocation has no effect and the value empty is
returned. If the invocation aborts, it has no effect and the value abort is returned.
5
An SpDeque implementation is constant-time iff any invocation to each of these methods
takes at most a constant number of steps to return. Say that a set of invocations to an SpDeque’s
methods meets the relaxed semantics iff there is a set of linearization times for the corresponding
non-aborting invocations such that:
1. Every non-aborting invocation’s linearization time lies within the beginning and completion
times of the respective invocation;
2. No linearization times associated with distinct non-aborting invocations coincide;
3. The return values for the non-aborting invocations are consistent with a serial execution of
the methods in the order given by the linearization times of the corresponding non-aborting
invocations; and
4. For each aborted popTop invocation x to an SpDeque d, there exists another invocation
removing the topmost item from d whose linearization time falls between the beginning
and completion times of invocation x.
2.2 The Low Cost Work Stealing Algorithm
Algorithm 1 depicts the specification of the Low Cost Work Stealing algorithm. Each processor
owns an SpDeque that uses to store its attached nodes and, additionally, owns a targeted flag that
stores a Boolean value. This flag is used to implement an asynchronous notification mechanism
that allows thieves to request their victims to expose work, allowing it to be stolen. Even
though, in practice, the notification mechanism our algorithm can be implemented using signals,
to perform a correct analysis of the algorithm’s synchronization overheads all the possible sources
of such overheads must be explicit, for which reason we chose to embed a simple notification
mechanism into the algorithm’s specification. Although the targeted flag of each processor can
be simultaneously accessed by multiple processors, to ensure the algorithm’s correctness it suffices
to guarantee that no read nor write operation to a processor’s targeted flag is cached.
Algorithm 1 The Low Cost Work Stealing algorithm.
1: procedure Scheduler
2: while computation not terminated do
3: if self.targeted then
4: self.spdeque.updateBottom()
5: self.targeted← false
6: end if
7: if ValidNode(assigned) then
8: enabled←execute(assigned)
9: if length(enabled) > 0 then
10: assigned← enabled [0]
11: if length(enabled) = 2 then
12: self.spdeque.push(enabled [1])
13: end if
14: else
15: assigned← self.spdeque.pop()
16: if assigned = race then
17: assigned← self.spdeque.popBottom()
18: end if
19: end if
20: else
21: self .WorkMigration()
22: end if
23: end while
24: end procedure
25: procedure WorkMigration
26: victim← UniformlyRandomProcessor()
27: assigned← victim.spdeque.popTop()
28: if assigned = empty then
29: victim.targeted← true
30: end if
31: end procedure
32: function ValidNode(node)
33: return node 6= empty and node 6= abort
and node 6= none
34: end function
Before a computation’s execution begins, every processor sets its assigned node to none
and its targeted flag to false. To start the execution, one of the processors gets the root node
assigned.
As we will see, the behavior of Low Cost Work Stealing is similar to the original Work Stealing
algorithm. Consider some processor p working on a computation scheduled by Low Cost Work
Stealing, and some iteration of the scheduling loop that p executes (corresponding to lines 2 to
23 of Algorithm 1). First, p reads the value of its targeted flag to check if it has been notified
by some thief. If p’s targeted flag is set to true (i.e. if p was notified), the processor tries to
make a node stealable, by invoking updateBottom to its SpDeque. After that, and regardless of
that invocation’s outcome, p resets its targeted flag back to false. The subsequent behavior of
p depends on whether it has an assigned node.
6
• If p has an assigned node, p executes the node. From this execution, either zero, one or
two nodes can be enabled.
Zero nodes enabled The processor tries to fetch the bottommost node stored in its
SpDeque. To that end, p first tries to fetch a node from the bottom of its SpDeque’s
private part (line 15). If p finds that part empty, it then tries to fetch a node from
the public part (line 17). If this part is also empty, p becomes a thief and starts a
work stealing phase. On the other hand, if p successfully fetched a node from any of
the parts of its SpDeque, then the returned node becomes p’s new assigned node.
One node enabled The enabled node becomes p’s new assigned node (line 10).
Two nodes enabled One of the enabled nodes becomes p’s new assigned node, whilst
the other is pushed into the bottom of the private part of p’s SpDeque (line 12).
• If p does not have an assigned node, it is searching for work. In this situation, the processor
first targets, uniformly at random, a victim processor and then attempts to steal work from
the public part of the victim’s SpDeque (lines 26 and 27). If the attempt is successful, the
stolen node becomes p’s new assigned node. If the attempt aborts, p simply gives up on
the steal attempt. For last, if p finds the public part of the victim’s SpDeque empty it sets
the victim’s targeted flag to true (line 29), notifying the victim that it found the public
part of the victim’s SpDeque empty.
2.3 A Split Deque Implementation
Algorithm 2 depicts a possible implementation of the lock-free SpDeque, based on the deque’s
implementation given in [7]. As illustrated in Figure 1, each SpDeque object has four instance
variables:
entries — an array of ready nodes.
privateBottom — the index below the bottommost node of the SpDeque.
officialBottom — the index below the bottommost node of the SpDeque’s public part.
age— composed of two fields: top, which corresponds to the index right below the topmost node
of the SpDeque’s public part; and tag, which is only used to ensure correction (avoiding
the famous ABA problem).
We say that a set of invocations is good if and only if the methods push, pop, updateBottom
and popBottom are never invoked concurrently. For Low Cost Work Stealing, as only the owner
of each SpDeque can invoke these methods, it is easy to deduce that all sets of invocations
issued by the algorithm are good. Furthermore, we claim that the implementation depicted in
Algorithm 2 is constant-time and meets the relaxed semantics (defined in Section 2.1) on any
good set of invocations. However, even though all methods are composed by a small number of
instructions and none includes a loop, proving this claim is not a straightforward task because
all possible execution interleaves have to be considered. Moreover, as the main focus of this
study is not related with programs’ verification, the proof of this claim falls out of the scope of
this paper. Yet, we remark that the proposed implementation is a simple extension of the deque
implementation presented in [7], which has been proven in [12] to be a correct implementation,
meeting the relaxed deque semantics on any set of invocations made by the Work Stealing
algorithm. For this reason, throughout this paper we assume that for any set of invocations
issued by the Low Cost Work Stealing algorithm, the relaxed semantics is always satisfied.
Lemma 2.1. No invocation to push requires a MFence instruction.
Proof. Since the push method operates only once over a single publicly accessible field (entries)
of the SpDeque’s state, no MFence instructions are required.
Lemma 2.2. No invocation to pop requires a MFence instruction.
Proof. Any invocation to the pop method only reads from two publicly accessible fields of the
SpDeque’s state, namely officialBottom (line 8) and entries (line 12). However, due to a data
dependency, no re-ordering between these read operations may occur, and so, no MFence in-
structions are required.
7
Algorithm 2 The SpDeque implementation
privateBottom← 0 // private field
entries← {} // private read-write, public read-only
officialBottom ← 0 // private read-write, public read-
only
age← {0, 0} // public field
1: procedure push(node)
2: pBot← self.privateBottom
3: self.entries[pBot]← node
4: self.privateBottom← pBot+ 1
5: end procedure
6: procedure pop
7: pBot← self.privateBottom
8: if pBot = self.officialBottom then return race
9: end if
10: pBot← pBot− 1
11: node← self.entries[pBot]
12: self.privateBottom← pBot
13: return node
14: end procedure
15: procedure popTop
16: oldAge← self.age
17: oldBottom← self.officialBottom
18: if oldBottom ≤ oldAge.top then return empty
19: end if
20: node← self.entries[oldAge.top]
21: newAge← oldAge
22: newAge.top← newAge.top+ 1
23: if CAS(age, oldAge, newAge) = success then
24: return node
25: end if
26: return abort
27: end procedure
28: procedure updateBottom
29: pBot← self.privateBottom
30: oBot← self.officialBottom
31: if pBot > oBot then oBot← oBot+ 1
32: end if
33: self.officialBottom← oBot
34: end procedure
35: procedure popBottom
36: oBot← self.officialBottom
37: if oBot = 0 then return empty
38: end if
39: oBot← oBot− 1
40: self.officialBottom← oBot
41: node← self.entries[oBot]
42: oldAge← age
43: if oBot > oldAge.top then return node
44: end if
45: self.officialBottom← 0
46: self.privateBottom← 0
47: newAge.top← 0
48: newAge.tag ← oldAge.tag + 1
49: if oBot = oldAge.top then
50: if CAS(age, oldAge, newAge) = success then
51: return node
52: end if
53: end if
54: self.age← newAge
55: return empty
56: end procedure
The dag of a computation is dynamically unfolded during its execution. If the execution of
a node u enables another node u′, then (u, u′) is an enabling edge and refer to node u as the
designated parent of u′. Refer to the tree formed by the enabling edges of a particular execution
of a dag by enabling tree, and denote the depth of a node u within this tree by d (u). Define
the weight of u as w (u) = T∞ − d (u). Similarly to [7], our analysis is made in an a posteriori
fashion, allowing us to refer to the enabling tree generated by a computation’s execution.
The following corollary is a direct consequence of the standard properties of deques (a full
proof can be found in the appendix (Lemma 5.1)).
Corollary 2.3. Let v1, . . . , vk denote the nodes stored in some processor p’s SpDeque, ordered
from the bottom of the SpDeque to the top, at some moment during the execution of Low Cost
Work Stealing. Moreover, let v0 denote p’s assigned node (if any). Then, we have w (v0) ≤
w (v1) < . . . < w (vk−1) < w (vk).
3 Analysis
In this section we obtain bounds on the expected execution time of computations using Low Cost
Work Stealing, and on the expected synchronization overheads incurred by the scheduler. The
analysis we make follows the same overall idea as the one given in [7]. Due to space restrictions,
it is not possible to include all the proofs in the paper (which are thus presented in the appendix).
Before advancing any further, we introduce a few more essential definitions.
Define a scheduling iteration as a sequence of instructions executed by a processor corre-
sponding to a particular iteration of the scheduling loop (lines 2 to 23 of Algorithm 1). Thus,
the full sequence of instructions executed by each processor during a computation’s execution
can be partitioned into scheduling iterations. As in [7], we introduce the concept of milestone:
an instruction within the sequence executed by a processor is a milestone iff it corresponds to
a node’s execution (line 8) or to the return of a call to WorkMigration (line 31). Taking into
8
account the definition of a scheduling iteration it is clear that any scheduling iteration of the
algorithm includes a milestone. Refer to iterations whose milestone corresponds to a node’s
execution as busy iterations, and refer to the remainder as idle iterations. As one might note, if
a processor has an assigned node at the beginning of an iteration’s execution, the iteration is a
busy one, and, otherwise, the iteration is an idle one. By observing the scheduling loop (lines 2
to 23 of Algorithm 1), and taking into account that the SpDeque’s implementation is constant
time, it is clear that any scheduling iteration is composed of a constant number of instructions. It
then follows that any processor executes at most a constant number of instructions between two
consecutive milestones. Throughout the analysis, let C denote a constant that is large enough
to guarantee that any sequence of instructions executed by a processor with length at least C
includes a milestone.
We can now bound the execution time of a computation depending on the number of idle
iterations that take place during that computation’s execution. The proof of the following re-
sult can be found in the appendix (Lemma 5.3), and is a trivial variant of [7, Lemma 5], but
considering the Low Cost Work Stealing algorithm.
Lemma 3.1. Consider any computation with work T1 being executed by P processors, under
Low Cost Work Stealing. The execution time is O
(
T1
P +
I
P
)
, where I denotes the number of idle
iterations executed by processors.
As we will see, the following two results are key, as they show that the synchronization
overheads incurred by Low Cost Work Stealing (essentially) only depend on the number of idle
iterations that take place during a computation’s execution (proofs in appendix (Lemmas 5.4
and 5.5, respectively)).
Lemma 3.2. Consider a processor p executing a busy iteration such that p’s targeted flag is set
to false when p checks it at the beginning of the iteration. If the execution of p’s assigned node
enables one or more nodes, or, if the private part of p’s SpDeque is not empty, then, no MFence
instruction is required during the execution of the iteration.
Lemma 3.3. Consider any computation being executed by the Low Cost Work Stealing algo-
rithm, using P processors. The number of CAS and MFence instructions executed by processors
during the computation’s execution is at most O (I + P ), where I denotes the total number of
idle iterations executed by processors.
3.1 Bounds on the expected number of idle iterations
The rest of the analysis focuses on bounding the number of idle iterations that take place during
a computation’s execution, and follows the same general arguments as the analysis presented
in [7].
We say that a node u is stealable if u is stored in the public part of some processor’s SpDeque.
Furthermore, we denote the set of ready nodes at some step i by Ri. Consider any node u ∈ Ri.
The potential associated with u at step i is denoted by φi (u) and is defined as
φi (u) =

43w(u)−2 ifu is assigned
43w(u)−1 ifu is stealable
43w(u) otherwise
The total potential at step i, denoted by Φi, corresponds to the sum of potentials of all the
nodes that are ready at that step: Φi =
∑
u∈Ri φi (u).
The following lemma is a formalization of the arguments already given in [7], but considering
the potential function we present (proof in appendix (Lemma 5.6)).
Lemma 3.4. Consider some node u, ready at step i during the execution of a computation.
1. If u gets assigned to a processor at that step, the potential drops by at least 34φi (u).
9
2. If u becomes stealable at that step, the potential drops by at least 34φi (u).
3. If u was already assigned to a processor and gets executed at that step i, the potential drops
by at least 4764φi (u).
For the remainder of the analysis, we make use of a few more definitions, first introduced
in [7]. We denote the set of ready nodes attached to some processor p (i.e. the ready nodes in p’s
SpDeque together with the node it has assigned, if any) at the beginning of some step i by Ri (p).
Furthermore, we define the total potential associated with p at step i as the sum of the potentials
of each of the nodes that is attached to p at the beginning of that step Φi (p) =
∑
u∈Ri(p) φi (u).
For each step i, we partition the processors into two sets, Di and Ai, where the first is the set
of all processors whose SpDeque is not empty at the beginning of step i while the second is the
set of all other processors (i.e. the set of all processors whose SpDeque is empty at the beginning
of that step). Thus, the potential of any step i, Φi, is composed by the potential associated
with each of these two partitions Φi = Φi (Di) + Φi (Ai), where Φi (Di) =
∑
p∈Di Φi (p) and
Φi (Ai) =
∑
p∈Ai Φi (p).
The following lemma is a direct consequence of Corollary 2.3 and of the potential function’s
properties (proof in appendix (Lemma 5.7)).
Lemma 3.5. Consider any step i and any processor p ∈ Di. The top-most node u in p’s SpDeque
contributes at least 45 of the potential associated with p. That is, we have φi (u) ≥ 45Φi (p).
With this, we now show that if a processor p is targeted by a steal attempt, then p’s potential
decreases by a constant factor (proof in appendix (Lemma 5.8)).
Lemma 3.6. Suppose a thief processor p chooses a processor q ∈ Di as its victim at some step
j, such that j ≥ i (i.e. a steal attempt of p targeting q occurs at step j). Then, at step j + 2C,
the potential decreased by at least 35Φi (q) due to either the assignment of the topmost node in q’s
SpDeque, or for making the topmost node of q’s SpDeque become stealable.
The next lemma is a trivial generalization of the original result presented in [7, Balls and
Weighted Bins] (proof in appendix (Lemma 5.9)).
Lemma 3.7 (Balls and Weighted Bins). Suppose we are given at least B balls, and exactly B
bins. Each of the balls is tossed independently and uniformly at random into one of the B bins,
where for i = 1, . . . , B, bin i has a weight Wi. The total weight is W =
∑B
i=1Wi. For each bin
i, we define the random variable Xi as
Xi =
{
Wi if some ball lands in bin i
0 otherwise
and define the random variable X as X =
∑B
i=1Xi. Then, for any β in the range 0 < β < 1, we
have P {X ≥ βW} ≥ 1− 1(1−β)e .
The following result states that for each P idle iterations that take place, with constant prob-
ability, the total potential drops by a constant factor. The result is a consequence of Lemmas 3.6
and 3.7 (proof in appendix (Lemma 5.10)).
Lemma 3.8. Consider any step i and any later step j such that at least P idle iterations occur
from i (inclusive) to j (exclusive). Then, we have P
{
Φi − Φj+2C ≥ 310Φi (Di)
}
> 14 .
Following Lemma 3.8, we are able to bound the expected number of idle iterations that take
place during a computation’s execution using the Low Cost Work Stealing algorithm (proof in
appendix (Lemma 5.11)).
Lemma 3.9. Consider any computation with work T1 and critical-path length T∞ being executed
by Low Cost Work Stealing using P processors. The expected number of idle iterations is at
most O (PT∞), and with probability at least 1 − ε, the number of idle iterations is at most
O
((
T∞ + ln
(
1
ε
))
P
)
.
10
0 5 10 15 20 25
102
103
104
105
106
107
108
Span
N
r.
sy
nc
.
op
s.
CWS (CAS + MFence)
LCWS (CAS + MFence)
LCWS (Notifs)
(a) 64 procs, #sync ops vs span,
regular dags.
300 350 400 450 500 550 600 650
104
105
106
107
108
Span
N
r.
sy
nc
.
op
s.
CWS (CAS + MFence)
LCWS (CAS + MFence)
LCWS (Notifs)
(b) 64 procs, #sync ops vs span,
irregular dags.
1 8 16 24 32 40 48 56 64
101
102
103
104
105
106
107
Processors executing the computation
N
r.
sy
nc
.
op
s.
CWS (CAS + MFence)
LCWS (CAS + MFence + Notifs)
(c) #sync ops vs #procs, regular
dags.
Figure 2: Comparison of Low Cost Work Stealing (LCWS) with Classical Work Stealing (CWS).
Finally, using Lemma 3.9, we can obtain bounds on both expected runtime of computa-
tions executed by the Low Cost Work Stealing algorithm, and the associated synchronization
overheads.
Theorem 3.10. Consider any computation with work T1 and critical-path length T∞ being ex-
ecuted by the Low Cost Work Stealing algorithm with P processors. The expected execution
time is at most O
(
T1
P + T∞
)
, and with probability at least 1 − ε, the execution time is at most
O
(
T1
P + T∞ + ln
(
1
ε
))
. Moreover, the expected number of CAS and MFence instructions executed
during the computation’s execution caused by Low Cost Work Stealing is at most O (PT∞), and
with probability at least 1 − ε the number of CAS and MFence instructions executed is at most
O
(
P
(
T∞ + ln
(
1
ε
)))
.
Proof. Both results follow directly from Lemmas 3.1, 3.3 and 3.9.
Corollary 3.11. Consider the statement of Theorem 3.10. Furthermore, let CCAS and CMFence
denote, respectively, the maximum synchronization overheads incurred by the execution of a CAS
and MFence instructions. The expected synchronization overheads incurred by Low Cost Work
Stealing are at most
O ((CCAS + CMFence)PT∞) ,
and with probability at least 1 − ε the synchronization overheads incurred by Low Cost Work
Stealing are at most O
(
(CCAS + CMFence)P
(
T∞ + ln
(
1
ε
)))
.
Proof. As already mentioned, and by considering the definition of Low Cost Work Stealing,
depicted in Algorithm 1, the only synchronization mechanisms the scheduler uses are CAS and
MFence instructions. This corollary is then a direct consequence of Theorem 3.10 that takes into
account the maximum possible overhead incurred by the execution of a single CAS and MFence
instructions.
4 Comparison with Work Stealing
To get a better understanding of the importance of avoiding synchronization for local deque
accesses, we now compare the synchronization costs of our algorithm against conventional Work
Stealing algorithms that use concurrent deques. To that end, we developed a simulator that,
given a computation’s dag, executes it, monitoring not only the number of CAS and MFence
instructions executed but also the number times that thieves requested other processors to expose
work. In this section we use the term notification to refer to when a thief sets another processor’s
targeted flag to true, requesting it to expose work.
For this comparison, we consider two distinct classes of dags: regular and irregular. Reg-
ular dags essentially correspond to trees of instructions where every non-leaf instruction forks
two other instructions, and whose depth is given by an argument that is passed to the simu-
lator. Irregular Dags are intended to simulate unbalanced computations. To that end, we use
the argument passed to the simulator as the total depth of the dag, and make the depth be-
tween each two consecutive fork instructions follow an exponential distribution with parameter
11
λ = 0.05. The first class of dags corresponds to computations with balanced parallelism (e.g. Fi-
bonacci, Parallel-For, etc) whist the second corresponds to the ones with unbalanced parallelism
(e.g. Graph Searches).
From Figure 2a, it is clear that while for Work Stealing with concurrent deques the number of
synchronization operations grows linearly with the total amount of work (and thus exponentially
increases with the span of the dag), for Low Cost Work Stealing the number of synchronization
operations and notifications scales linearly with the span of the computation. Thus, even if
the costs of handling notifications (i.e. of exposing work) were a thousand times greater than
the cost of executing CAS or MFence instructions, for dags with fork-span of at least ≈ 20,
our algorithm would incur in less synchronization overheads. In practice, computations with a
fork-span ≥ 20 are very common, especially among fine-grained parallelism. Unfortunately, due
to the limitations that come with using a simulator, we have not been able to benchmark dags
with a fork-span greater than 25. Yet, we remark that the trend is obvious and confirms that
the use of SpDeques allows to avoid most of the synchronization that is present in conventional
Work Stealing algorithms. Figure 2b reinforces our insight, showing that even for computations
exhibiting irregular parallelism, Low Cost Work Stealing is able to avoid most of the synchro-
nization present in Work Stealing algorithms that use concurrent deques. Finally, Figure 2c
shows that, while the synchronization costs for Work Stealing are always extremely high, even
for single processor executions, for Low Cost Work Stealing these costs only grow linearly with
the number of processors used and, for a single processor execution, synchronization is negligible.
From a more theoretical perspective, note that, by taking into account our assumptions
(which are standard [1, 2, 3, 4, 6, 7, 10, 22, 23]) regarding computations’ structure, we can create
computations for which T1 = O
(
2T∞
)
(which correspond to dags of the first class). Since for
such computations the number of deque accesses is directly proportional to the total amount of
work (T1), our result shows that the use of SpDeques allows to reduce by almost an exponential
factor the synchronization present in conventional Work Stealing algorithms.
5 Conclusion
In this paper we studied a Work Stealing algorithm that uses SpDeques to reduce synchronization
overheads. Whereas traditional Work Stealing algorithms require synchronization for every time
processors access deques, in our proposal, synchronization operations are employed optimally,
which is the key for eliminating most unnecessary synchronization overheads. By default, busy
processors operate locally on their deques without any synchronization, resembling a sequential
execution. Idle processors can request busy ones to expose some of their work, thus allowing
for load balancing via direct steals. This lazy approach for using synchronization is the key for
guaranteeing an asymptotically optimal expected runtime while provably reducing synchroniza-
tion overheads. Indeed, we proved that the expected total synchronization of the algorithm is
O (PT∞ (CCAS + CMFence)). To justify the tightness of our bounds, we recall that, for Low Cost
Work Stealing, the expected number of (successful and unsuccessful) steal attempts is O (PT∞).
By noting that the public part of an SpDeque is essentially a concurrent deque, and, by taking
into account the impossibility of eliminating all synchronization from the implementation of con-
current deques while maintaining their correctness (see [8]), we conclude that the synchronization
bounds we have obtained for Low Cost Work Stealing are tight.
Finally, and as already discussed in Section 4, for several types of computations, the syn-
chronization overheads of conventional Work Stealing algorithms grow linearly with both the
total amount of work and the number of steal attempts. For numerous classes of parallel com-
putations, the total amount of work increases exponentially with the span of the computation
(i.e. T1 = O
(
2T∞
)
). From this perspective, our results make evident the significance of the
improvement of Low Cost Work Stealing over prior Work Stealing algorithms: not only are the
synchronization overheads incurred by our algorithm (essentially) exponentially smaller than pre-
vious algorithms, but our algorithm also maintains the asymptotically optimal expected runtime
bounds of the concurrent deque Work Stealing algorithm [7].
12
References
[1] Umut A. Acar, Guy E. Blelloch, and Robert D. Blumofe. The data locality of work stealing.
Theory Comput. Syst., 35(3):321–347, 2002.
[2] Umut A. Acar, Arthur Charguéraud, and Mike Rainey. Scheduling parallel programs by work
stealing with private deques. In ACM SIGPLAN Symposium on Principles and Practice of
Parallel Programming, PPoPP ’13, Shenzhen, China, February 23-27, 2013, pages 219–228,
2013.
[3] Kunal Agrawal, Yuxiong He, Wen-Jing Hsu, and Charles E. Leiserson. Adaptive scheduling
with parallelism feedback. In 21th International Parallel and Distributed Processing Sympo-
sium (IPDPS 2007), Proceedings, 26-30 March 2007, Long Beach, California, USA, pages
1–7, 2007.
[4] Kunal Agrawal, Charles E. Leiserson, Yuxiong He, and Wen-Jing Hsu. Adaptive work-
stealing with parallelism feedback. ACM Trans. Comput. Syst., 26(3), 2008.
[5] Noga Alon and Joel Spencer. The Probabilistic Method. John Wiley, 1992.
[6] Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread scheduling for multipro-
grammed multiprocessors. In SPAA, pages 119–129, 1998.
[7] Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread scheduling for multipro-
grammed multiprocessors. Theory Comput. Syst., 34(2):115–144, 2001.
[8] Hagit Attiya, Rachid Guerraoui, Danny Hendler, Petr Kuznetsov, Maged M. Michael, and
Martin T. Vechev. Laws of order: expensive synchronization in concurrent algorithms
cannot be eliminated. In Proceedings of the 38th ACM SIGPLAN-SIGACT Symposium
on Principles of Programming Languages, POPL 2011, Austin, TX, USA, January 26-28,
2011, pages 487–498, 2011.
[9] Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson,
Keith H. Randall, and Yuli Zhou. Cilk: An efficient multithreaded runtime system. J.
Parallel Distrib. Comput., 37(1):55–69, 1996.
[10] Robert D. Blumofe and Charles E. Leiserson. Scheduling multithreaded computations by
work stealing. J. ACM, 46(5):720–748, 1999.
[11] Robert D Blumofe and Dionisios Papadopoulos. The performance of work stealing in mul-
tiprogrammed environments. In ACM SIGMETRICS Performance Evaluation Review, vol-
ume 26, pages 266–267. ACM, 1998.
[12] Robert D Blumofe, C Greg Plaxton, and Sandip Ray. Verification of a concurrent deque
implementation. University of Texas at Austin, Austin, TX, 1999.
[13] David Chase and Yossi Lev. Dynamic circular work-stealing deque. In SPAA 2005: Proceed-
ings of the 17th Annual ACM Symposium on Parallelism in Algorithms and Architectures,
July 18-20, 2005, Las Vegas, Nevada, USA, pages 21–28, 2005.
[14] James Dinan, Sriram Krishnamoorthy, D. Brian Larkins, Jarek Nieplocha, and P. Sadayap-
pan. Scioto: A framework for global-view task parallelism. In 2008 International Conference
on Parallel Processing, ICPP 2008, September 8-12, 2008, Portland, Oregon, USA, pages
586–593, 2008.
[15] James Dinan, D. Brian Larkins, P. Sadayappan, Sriram Krishnamoorthy, and Jarek
Nieplocha. Scalable work stealing. In Proceedings of the ACM/IEEE Conference on High
Performance Computing, SC 2009, November 14-20, 2009, Portland, Oregon, USA, 2009.
13
[16] Toshio Endo, Kenjiro Taura, and Akinori Yonezawa. A scalable mark-sweep garbage collec-
tor on large-scale shared-memory machines. In Proceedings of the ACM/IEEE Conference
on Supercomputing, SC 1997, November 15-21, 1997, San Jose, CA, USA, page 48, 1997.
[17] Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation of the
cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN ’98 Conference on
Programming Language Design and Implementation (PLDI), Montreal, Canada, June 17-
19, 1998, pages 212–223. ACM, 1998.
[18] Tasuku Hiraishi, Masahiro Yasugi, Seiji Umatani, and Taiichi Yuasa. Backtracking-based
load balancing. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and
Practice of Parallel Programming, PPOPP 2009, Raleigh, NC, USA, February 14-18, 2009,
pages 55–64, 2009.
[19] Jonathan Lifflander, Sriram Krishnamoorthy, and Laxmikant V. Kalé. Work stealing
and persistence-based load balancers for iterative overdecomposed applications. In The
21st International Symposium on High-Performance Parallel and Distributed Computing,
HPDC’12, Delft, Netherlands - June 18 - 22, 2012, pages 137–148, 2012.
[20] Maged M. Michael, Martin T. Vechev, and Vijay A. Saraswat. Idempotent work stealing. In
Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel
Programming, PPOPP 2009, Raleigh, NC, USA, February 14-18, 2009, pages 45–54, 2009.
[21] Adam Morrison and Yehuda Afek. Fence-free work stealing on bounded TSO processors.
In Architectural Support for Programming Languages and Operating Systems, ASPLOS ’14,
Salt Lake City, UT, USA, March 1-5, 2014, pages 413–426, 2014.
[22] Stefan K. Muller and Umut A. Acar. Latency-hiding work stealing: Scheduling interacting
parallel computations with work stealing. In Proceedings of the 28th ACM Symposium on
Parallelism in Algorithms and Architectures, SPAA 2016, Asilomar State Beach/Pacific
Grove, CA, USA, July 11-13, 2016, pages 71–82, 2016.
[23] Marc Tchiboukdjian, Nicolas Gast, Denis Trystram, Jean-Louis Roch, and Julien Bernard.
A tighter analysis of work stealing. In Algorithms and Computation - 21st International
Symposium, ISAAC 2010, Jeju Island, Korea, December 15-17, 2010, Proceedings, Part II,
pages 291–302, 2010.
[24] Alexandros Tzannes. Enhancing productivity and performance portability of general-
purpose parallel programming. 2012.
[25] Tom van Dijk and Jaco C. van de Pol. Lace: Non-blocking split deque for work-stealing. In
Euro-Par 2014: Parallel Processing Workshops - Euro-Par 2014 International Workshops,
Porto, Portugal, August 25-26, 2014, Revised Selected Papers, Part II, pages 206–217, 2014.
[26] Thijs van Ede. Certainty in lockless concurrent algorithms: an informal proof of lace. 2015.
14
Appendix: Proofs
Due to space constraints, most of the proofs of our claims are placed in this appendix. Never-
theless, if this paper is accepted, we will publish a full version of this paper, with proofs, in a
freely accessible on-line repository.
The following lemma is crucial for the performance analysis of Low Cost Work Stealing. An
analogous result has already been proved for concurrent deques (see [7, Lemma 3]). For the sake
of completion we present its proof, which is a simple transcription of original proof of [7, Lemma
3], adapted for SpDeques.
Lemma 5.1 (Structural Lemma for SpDeques). Let v1, . . . , vk denote the nodes stored in some
processor p’s SpDeque, ordered from the bottom of the SpDeque to the top, at some point in the
linearized execution of Low Cost Work Stealing. Moreover, let v0 denote p’s assigned node (if
any), and for i = 0, . . . , k let ui denote the designated parent of vi in the enabling tree. Then, for
i = 1, . . . , k, ui is an ancestor of ui−1 in the enabling tree, and despite v0 and v1 may have the
same designated parent (i.e. u0 = u1), for i = 2, 3, . . . , k, ui−1 6= ui (i.e. the ancestor relationship
is proper).
Proof. Fix a particular SpDeque. The SpDeque state and assigned node only change when the
assigned node is executed or a thief performs a successful steal. We prove the claim by induction
on the number of assigned-node executions and steals since the SpDeque was last empty. In the
base case, if the SpDeque is empty, then the claim holds vacuously. We now assume that the
claim holds before a given assigned-node execution or successful steal, and we will show that
it holds after. Specifically, before the assigned-node execution or successful steal, let v0 denote
the assigned node; let k denote the number of nodes in the SpDeque; let v1, . . . , vk denote the
nodes in the SpDeque ordered from the bottom to top; and for i = 0, . . . , k, let ui denote the
designated parent of vi. We assume that either k = 0, or for i = 1, . . . , k, node ui is an ancestor
of ui−1 in the enabling tree, with the ancestor relationship being proper, except possibly for the
case i = 1. After the assigned-node execution or successful steal, let v0′ denote the assigned
node; let k′ denote the number of nodes in the SpDeque; let v1′, . . . , vk ′ denote the nodes in the
SpDeque ordered from bottom to top; and for i = 1, . . . , k′, let ui′ denote the designated parent
of vi′. We now show that either k′ = 0, or for i = 1, . . . , k′, node ui′ is an ancestor of ui−1′ in the
enabling tree, with the ancestor relationship being proper, except possibly for the case i = 1.
Consider the execution of the assigned node v0 by the owner.
If the execution of v0 enables 0 children, then the owner pops the bottommost node off its
SpDeque and makes that node its new assigned node. If k = 0, then the SpDeque is empty; the
owner does not get a new assigned node; and k′ = 0. If k > 0, then the bottommost node v1 is
popped and becomes the new assigned node, and k′ = k − 1. If k = 1, then k′ = 0. Otherwise,
k′ = k − 1. We now rename the nodes as follows. For i = 0, . . . , k′, we set vi′ = vi+1 and
ui
′ = ui+1. We now observe that for i = 1, . . . , k′, node ui′ is a proper ancestor of ui−1′ in the
enabling tree.
If the execution of v0 enables 1 child x, then x becomes the new assigned node; the designated
parent of x is v0; and k′ = k. If k = 0, then k′ = 0. Otherwise, we can rename the nodes as
follows. We set v0′ = x; we set u0′ = v0; and for i = 1, . . . , k′, we set vi′ = vi and ui′ = ui. We
now observe that for i = 1, . . . , k′, node ui′ is a proper ancestor of ui−1′ in the enabling tree.
That u1′ is a proper ancestor of u0′ in the enabling tree follows from the fact that (u0, v0) is an
enabling edge.
In the most interesting case, the execution of the assigned node v0 enables 2 children x and y,
with x being pushed onto the bottom of the SpDeque and y becoming the new assigned node. In
this case, (v0, x) and (v0, y) are both enabling edges, and k′ = k + 1. We now rename the nodes
as follows. We set v0′ = y; we set u0′ = v0; we set v1′ = x; we set u1′ = v0; and for i = 2, . . . , k′,
we set vi′ = vi−1 and ui′ = ui−1. We now observe that u1′ = u0′, and for i = 2, . . . , k′, node
ui
′ is a proper ancestor of ui−1′ in the enabling tree. That u2′ is a proper ancestor of u1′ in the
enabling tree follows from the fact that (u0, v0) is an enabling edge.
15
Finally, we consider a successful steal by a thief. In this case, the thief pops the topmost node
vk off the SpDeque, so k′ = k − 1. If k = 1, then k′ = 0. Otherwise, we can rename the nodes
as follows. For i = 0, . . . , k′, we set vi′ = vi and ui′ = ui. We now observe that for i = 1, . . . , k′,
node ui′ is an ancestor of ui−1′ in the enabling tree, with the ancestor relationship being proper,
except possibly for the case i = 1.
Corollary 5.2. If v0, v1, . . . , vk are as defined in the statement of Lemma 5.1, then we have
w (v0) ≤ w (v1) < . . . < w (vk−1) < w (vk).
We are now able to bound the execution time of a computation depending on the number
of idle iterations that take place during that computation’s execution. The following result is a
trivial variant of [7, Lemma 5] but considering the Low Cost Work Stealing algorithm, and is
only added for the sake of completion.
Lemma 5.3. Consider any computation with work T1 being executed by P processors, under
Low Cost Work Stealing. The execution time is O
(
T1
P +
I
P
)
, where I denotes the number of idle
iterations executed by processors.
Proof. Consider two buckets to which we add tokens during the computation’s execution: the
busy bucket and the idle bucket. At the end of each iteration, every processor places a token
into one of these buckets. If a processor executed a node during the iteration, it places a token
into the busy bucket, and otherwise, it places a token into the idle bucket. Since we have P
processors, for each C consecutive steps, at least P tokens are placed into the buckets.
Because, by definition, the computation has T1 nodes, there will be exactly T1 tokens in the
busy bucket when the computation’s execution ends. Moreover, as I denotes the number of idle
iterations, it also corresponds to the number of tokens in the idle bucket when the computation’s
execution ends. Thus, exactly T1 + I tokens are collected during the computation’s execution.
Taking into account that for each C consecutive steps at least P tokens are placed into the
buckets, we conclude the number of steps required to collect all the tokens is at most C.
(
T1
P +
I
P
)
.
After collecting all the T1 tokens, the computation’s execution terminates, implying the execution
time is at most O
(
T1
P +
I
P
)
.
Lemma 5.4. Consider a processor p executing a busy iteration such that p’s targeted flag is set
to false when p checks it at the beginning of the iteration. If the execution of p’s assigned node
enables one or more nodes, or, if the private part of p’s SpDeque is not empty, then, no MFence
instruction is required during the execution of the iteration.
Proof. Consider the Low Cost Work Stealing algorithm, depicted in Algorithm 1. The first
action p takes for the execution of that iteration is checking the value of its targeted flag (line
3). Since, by the statement of this lemma p’s targeted flag is set to false at the moment when
p checks the flag’s value, p does not enter the then branch of the if statement. Moreover, as a
consequence of the conditional statement of line 3, there is a control dependency that does not
allow the instructions succeeding the conditional expression to be reordered with the evaluation
of the condition, implying no MFence instruction is required until this point.
After that, p checks if it has an assigned node. Again, since the next action p takes depends
on its currently assigned node, there is a control dependency from the instruction where p checks
if it has a currently assigned node to both branches of the if statement. Thus, no instruction
reordering between the evaluation of the condition and any of the instructions succeeding that
evaluation can be made, implying no MFence instruction is required until here.
Because we assumed p was executing a busy iteration, by the definition of a busy iteration, p
must have an assigned node. Hence, p executes its assigned node. Since the next action p takes
depends on the outcome of that node’s execution, there is a control dependency between the
execution of p’s assigned node and the execution of the sequence of instructions corresponding
to each of the possible outcomes. Hence, no instruction reordering can be made, implying no
MFence instruction is necessary until this point.
From that node’s execution, three outcomes are possible:
16
0 nodes enabled In this case, p invokes the pop method to its own SpDeque. By Lemma 2.2,
the invocation does not require the execution of a MFence instruction. Furthermore, the
next instruction (line 16) has a data dependency on the value of p’s assigned node, for
which reason it cannot be reordered with the invocation of the pop method and so no
MFence instruction is required.
Since we have assumed that the private part of p’s SpDeque was not empty, it is trivial
to conclude that the pop invocation returns a node, which immediately becomes p’s new
assigned node. Thus, after having a new node assigned p takes no further action during the
iteration, meaning no MFence instruction was required for the execution of the iteration
in this situation.
1 node enabled In this case the enabled node becomes p’s new assigned node. Next, p checks
the number of nodes that were enabled. The assignment of one of the enabled nodes and
the instruction where p checks the number of nodes enabled can be reordered. Fortunately,
because enabled is a local variable that is solely accessed by p, there is no harm for a parallel
execution if the instructions are reordered and so no MFence instruction is required for this
case as well. Because p enabled a single node it takes no further action during the iteration,
implying the lemma holds in this situation as well.
2 nodes enabled Finally, for this case one of the enabled nodes becomes p’s new assigned node.
Using the same reasoning as for the case where a single node was enabled, we conclude that
no MFence instruction is required at least until the evaluation of the conditional statement.
Because p enabled two nodes, p enters the conditional expression and pushes the node it
did not assign into the bottom of its SpDeque, by invoking the push method. Since there
is a control dependency between the execution of this instruction and the evaluation of the
condition, no instruction reordering is allowed. Thus, no MFence instruction is required
between these two instructions.
Finally, Lemma 2.1 states that an invocation to the pushmethod does not require a MFence
instruction to be executed. Because after the invocation p takes no further action during
the iteration, we deduce the lemma holds, concluding its proof.
The following lemma is a consequence of Lemma 3.2 (corresponding to Lemma 5.4 of the
appendix) and states that the number of CAS and MFence instructions executed during a com-
putation’s execution using Low Cost Work Stealing only depends on the number of idle iterations
and processors.
Lemma 5.5. Consider any computation being executed by the Low Cost Work Stealing algo-
rithm, using P processors. The number of CAS and MFence instructions executed by processors
during the computation’s execution is at most O (I + P ), where I denotes the total number of
idle iterations executed by processors.
Proof. By observing Algorithms 1 and 2, it is easy to see that only invocations to popBottom
or popTop methods can lead to the execution of CAS instructions. Furthermore, both these
methods are invoked at most once per scheduling iteration, and, for both, at most one CAS
instruction is executed per invocation. Since processors only invoke the popTop method when
executing idle iterations, the number of CAS instructions caused by invocations to popTop is
O (I). On the other hand, processors only invoke the popBottom method during busy iterations
where the private part of their SpDeque is empty and the execution of their currently assigned
node does not enable any new nodes. Let p denote some processor executing one such iteration.
From p’s invocation to the popBottom method two outcomes are possible:
A node is returned In this case the public part of p’s SpDeque was not empty implying p had
previously transferred a node from the private part of its SpDeque to the public part. By
observing Algorithm 1 it is easy to deduce that p only makes these node transfers if some
thief had previously set p’s targeted flag to true. Moreover, because after transferring the
node p immediately sets its targeted flag back to false, the number of times p makes such
17
node transfers is at most the number of times it is targeted by a steal attempt. Taking into
account that processors only make steal attempts during the execution of idle iterations,
and make exactly one steal attempt for each such iteration, exactly I steal attempts take
place during a computation’s execution. As such, the number of CAS instructions executed
in situations like this one is at most O (I).
Empty is returned In this case p will not have an assigned node at the end of the scheduling
iteration’s execution. Thus, after p finishes executing the iteration two scenarios may occur:
p executes an idle iteration For this case we can create a mapping from idle iterations
to each busy iteration that precedes an idle iteration, implying there can be at most
O (I) such iterations. With this, it is trivial to conclude that the number of CAS
instructions executed by Low Cost Work Stealing for situations equivalent to this one
is at most O (I).
The computation’s execution terminates Because there are exactly P processors, at
most P scheduling iterations can precede the end of a computation’s execution. Con-
sequently, the number of CAS instructions executed for scenarios equivalent to this
one is at most O (P ).
Summing up all the possible scenarios, the number of CAS instructions executed by Low
Cost Work Stealing is at most O (I + P ).
We now turn to the number of MFence instructions executed during a computation’s execu-
tion. To that end, we first bound the number of scheduling iterations that can contain MFence
instructions. Consider any scheduling iteration s during a computation’s execution, and let p
denote the processor that executed the iteration. Iteration s was either an idle or a busy iteration.
s is an idle iteration By definition, at most I iterations are idle, implying there are O (I) such
iterations that could contain MFence instructions.
s is a busy iteration When p checks its targeted flag, one of the two following situations arises:
targeted is true By observing Algorithm 1 we conclude that such a situation can only
occur if another processor q has set p’s targeted to true, which can only occur if q
was executing an idle iteration. After executing the conditional statement, p resets
its targeted flag back to false. Thus, the total number of busy iterations where a
processor has its flag set to targeted is at most I, because each such iteration can be
mapped by an idle iteration. Consequently, the number of iterations similar to this
one is at most O (I).
targeted is false As p is executing a busy iteration, it will execute the node it has as-
signed. From that node’s execution, either 0, 1 or 2 other nodes can be enabled.
More than 0 nodes are enabled Lemma 3.2 (corresponding to Lemma 5.4 of the
appendix) implies that no MFence instruction is executed in this case.
0 nodes are enabled In this case, p cannot immediately assign a new node, because
it did not enable any. By Algorithm 1, p will then invoke the pop method to its
own SpDeque. With this, one of two possible situations arises:
SpDeque’s private part is not empty As a consequence of Lemma 3.2 (cor-
responding to Lemma 5.4 of the appendix), no MFence instruction is executed
in this case.
SpDeque’s private part is empty In this case, by observing Algorithm 2 we
conclude that the invocation returns the special value race, implying p will
make an invocation to popBottom still during that same iteration. From that
invocation, two outcomes are possible:
A node is returned In this situation, p assigns the node. By observing
Algorithm 1 it is trivial to conclude that this scenario only arises if some
processor previously set p’s targeted flag to true. As a consequence, p
transfered a node from the private part of its SpDeque to the public part.
Again, using the same reasoning as for the case where p’s targeted flag is
18
set to true, we conclude the number of such iterations is at most O (I).
empty is returned After p finishes executing the current scheduling itera-
tion s, two scenarios may occur:
p executes an idle iteration It is trivial to deduce that we can create a
mapping from idle iterations to each iteration satisfying the same condi-
tions as s. Thus, there can be at most O (I) such iterations.
The computation’s execution terminates Since there are exactly P
processors, at most P scheduling iterations can precede the end of a
computation’s execution. Consequently, there are at most P scheduling
iteration similar to s.
Now, we sum up all the scheduling iterations that may contain MFence instructions. Ac-
counting with all possible scenarios it follows that at most O (I + P ) scheduling iterations may
contain MFence instructions. Since any scheduling iteration is composed by at most C instruc-
tions, at most C MFence instructions can be executed per iteration, implying the number of
MFence instructions executed during a computation’s execution is at most O (I + P ).
The following lemma is a formalization of the arguments already given in [7], but considering
the potential function we present.
Lemma 5.6. Consider some node u, ready at step i during the execution of a computation.
1. If u gets assigned to a processor at that step, the potential drops by at least 34φi (u).
2. If u becomes stealable at that step, the potential drops by at least 34φi (u).
3. If u was already assigned to a processor and gets executed at that step i, the potential drops
by at least 4764φi (u).
Proof. Regarding the first claim, if u was stealable the potential decreases from 43w(u)−1 to
43w(u)−2. Otherwise, the potential decreases from 43w(u) to 43w(u)−2, which is even more than in
the previous case. Given that 43w(u)−1 − 43w(u)−2 = 34φi (u), we conclude that if u gets assigned
the potential decreases by at least 34φi (u).
Regarding the second one, note that u was not stealable (because it became stealable at step
i) and so the potential decreases from 43w(u) to 43w(u)−1. So, if u becomes stealable, the potential
decreases by 43w(u) − 43w(u)−1 = 34φi (u).
We now prove the last claim. Remind that, by our conventions regarding computations’
structure, each node within a computation’s dag can have an out-degree of at most two. Con-
sequently, each node can be the designated parent of at most two other ones in the enabling
tree. Moreover, by definition, the weight of any node is strictly smaller than the weight of its
designated parent, since it is deeper in the enabling tree than its designated parent. Consider
the three possible scenarios:
0 nodes enabled The potential decreased by φi (u).
1 node enabled The enabled node becomes the assigned node of the processor (that executed
u). Let x denote the enabled node. Since x is the child of u in the enabling tree, it follows
φi (u)− φi+1 (x) = 43w(u)−2− 43w(x)−2 = 43w(u)−2− 43(w(u)−1)−2 = 6364φi (u). Thus, for this
situation, the potential decreases by 6364φi (u).
2 nodes enabled In this case, one of the enabled nodes immediately becomes the assigned node
of the processor whist the other is pushed onto the bottom of the SpDeque’s private part.
Let x denote the enabled node that becomes the processor’s new assigned node and y the
other enabled node. Since both x and y have u as their designated parent in the enabling
tree, we have φi (u) − φi+1 (x) − φi+1 (y) = 43w(u)−2 − 43w(x) − 43w(y)−2 = 4764φi (u). As
such, the potential decreases by 4764φi (u), concluding the proof of the lemma.
The following lemma is a direct consequence of Corollary 2.3 (corresponding to Corollary 5.2
of the appendix) and of the potential function’s properties. The result is a variant of [7, Top-
19
Heavy Deques], considering SpDeques instead of the conventional fully concurrent deques, and
our potential function, instead of the original.
Lemma 5.7. Consider any step i and any processor p ∈ Di. The top-most node u in p’s SpDeque
contributes at least 45 of the potential associated with p. That is, we have φi (u) ≥ 45Φi (p).
Proof. This lemma follows from Corollary 2.3 (corresponding to Corollary 5.2 of the appendix).
We prove it by induction on the number of nodes within p’s SpDeque.
Base case As the base case, consider that p’s SpDeque contains a single node u. The processor
itself can either have or not an assigned node. For the second scenario, we have φi (u) =
Φi (p). Regarding the first case, let x denote p’s assigned node. Corollary 2.3 implies that
w (u) ≥ w (x). It follows Φi (q) = φi (u) + φi (x) = 43w(u)−1 + 43w(x)−2 ≤ 54φi (u). Thus, if
p’s SpDeque contains a single node we have Φi (q) ≤ 54φi (u).
Induction step Consider that p’s SpDeque now contains n nodes, where n ≥ 2, and let u, x
denote the topmost and second topmost nodes, respectively, within the SpDeque. For
the purpose of induction, let us assume the lemma holds for all the first n − 1 nodes
(i.e. without accounting with u): Φi (q) − φi (u) ≤ 54φi (x). Corollary 2.3 (corresponding
to Corollary 5.2 of the appendix) implies w (u) > w (x) ≡ w (u) − 1 ≥ w (x). It follows
Φi (q) ≤ 54φi (x) + φi (u) = 5443w(x) + 43w(u) ≤ 5443(w(u)−1) + 43w(u) = 261256φi (u) < 54φi (u)
concluding the proof of the lemma.
The following result is a consequence of Lemma 3.5 (corresponding to Lemma 5.7 of the
appendix).
Lemma 5.8. Suppose a thief processor p chooses a processor q ∈ Di as its victim at some step
j, such that j ≥ i (i.e. a steal attempt of p targeting q occurs at step j). Then, at step j + 2C,
the potential decreased by at least 35Φi (q) due to either the assignment of the topmost node in q’s
SpDeque, or for making the topmost node of q’s SpDeque become stealable.
Proof. Let u denote the topmost node of q’s SpDeque at the beginning of step i. We first prove
that u either gets assigned or becomes stealable.
Three possible scenarios may take place due to p’s steal attempt targeting q’s SpDeque.
The invocation returns a node If p stole u, then, u gets assigned to p. Otherwise, some
other processor removed u before p did, implying u got assigned to that other processor.
The invocation aborts Since the SpDeque implementation meets the relaxed semantics on any
good set of invocations, and because the Low Cost Work Stealing algorithm only makes
good sets of invocations, we conclude that some other processor successfully removed a
topmost node from q’s SpDeque during the aborted steal attempt made by p. If the
removed node was u, u gets assigned to a processor (that may either be q, or, some other
thief that successfully stole u). Otherwise, u must have been previously stolen by a thief
or popped by q, and thus became assigned to some processor.
The invocation returns empty This situation can only occur if either q’s SpDeque is com-
pletely empty, or if there is no node in the public part of q’s SpDeque.
• For the first case, since q ∈ Di, some processor must have successfully removed u from
q’s SpDeque. Consequently, u was assigned to a processor.
• If there was no node in the public part of q’s SpDeque, p sets q’s targeted flag to
true in a later step j′. Recall that, for each C consecutive instructions executed
by a processor, at least one corresponds to a milestone. It follows that j′ ≤ j + C.
Furthermore, by observing Algorithm 1, we conclude that q will make and complete
an invocation to updateBottom of its SpDeque in one of the C steps succeeding step
j′. Thus if q’s SpDeque’s private part is not empty, a node will become stealable.
From that invocation, only two possible situations can take place:
No node becomes stealable In this case, the private part of q’s SpDeque was
empty, implying some processor (either q or some thief) assigned u.
20
A node becomes stealable If the node that became stealable as the result of the
invocation was not u, then either u was assigned by a processor (that could have
been q or some thief), or u had already been transfered to the public part of
q’s SpDeque as a consequence of another thief’s steal attempt that also returned
empty, implying that either u became assigned, or it became stealable. Other-
wise, the node that became stealable as a result of the updateBottom’s invocation
was u. Thus, in any case, u either gets assigned to a processor or becomes steal-
able.
With this, we conclude that u either became assigned or became stealable until step j + 2C.
From Lemma 3.5 (corresponding to Lemma 5.7 of the appendix), we have φi (u) ≥ 45Φi (q).
Furthermore, Lemma 3.4 (corresponding to Lemma 5.6 of the appendix) proves that if u gets
assigned the potential decreases by at least 34φi (u), and if u becomes stealable the potential also
decreases by at least 34φi (u). Because u is either assigned or becomes stealable in any case, we
conclude the potential associated with q at step j + 2C has decreased by at least 35Φi (q).
The following lemma is trivial a generalization of the original result presented in [7, Balls
and Weighted Bins]. The only difference between the two results is the assumption of having at
least B balls, rather than exactly B balls. Its proof is only presented for the sake of completion,
and is (trivially) adapted from the proof of [7, Balls and Weighted Bins].
Lemma 5.9 (Balls and Weighted Bins). Suppose we are given at least B balls, and exactly B
bins. Each of the balls is tossed independently and uniformly at random into one of the B bins,
where for i = 1, . . . , B, bin i has a weight Wi. The total weight is W =
∑B
i=1Wi. For each bin
i, we define the random variable Xi as
Xi =
{
Wi if some ball lands in bin i
0 otherwise
and define the random variable X as X =
∑B
i=1Xi.
Then, for any β in the range 0 < β < 1, we have P {X ≥ βW} ≥ 1− 1(1−β)e .
Proof. Consider the random variable Wi − Xi taking the value of Wi when no ball lands in
bin i and 0 otherwise, and let B′ denote the total number of balls that are tossed. It follows
E [Wi −Xi] = Wi
(
1− 1B
)B′ ≤ Wie . From the linearity of expectation, we have E [W −X] ≤
W
e . Markov’s Inequality then implies P {W −X > (1− β)W} = P {X < βW} ≤ E[W−X](1−β)W ≤
1
(1−β)e .
The following result states that for each P idle iterations that take place, with constant prob-
ability the potential drops by a constant factor. An analogous lemma was originally presented
in [7, Lemma 8] for the non-blocking Work Stealing algorithm. The result is a consequence of
Lemmas 3.7 and 3.6 (corresponding to Lemmas 5.9 and 5.8 of the appendix, respectively) and
its proof follows the same traits as the one presented in that study.
Lemma 5.10. Consider any step i and any later step j such that at least P idle iterations occur
from i (inclusive) to j (exclusive). Then, we have
P
{
Φi − Φj+2C ≥ 3
10
Φi (Di)
}
>
1
4
.
Proof. By Lemma 3.6 (corresponding to Lemma 5.8 of the appendix) we know that for each
processor p ∈ Di that is targeted by a steal attempt, the potential drops by at least 35Φi (p), at
most 2C steps after being targeted.
When executing an idle iteration, a processor plays the role of a thief attempting to steal
work from some victim. Thus, since P idle iterations occur from step i (inclusive) to step j
(exclusive), at least P steal attempts take place during that same interval. We can think of each
such steal attempt as a ball toss of Lemma 3.7 (corresponding to Lemma 5.9 of the appendix).
21
For each processor p in Di, we assign it a weightWp = 35Φi (p), and for each other processor p
in Ai, we assign it a weight Wp = 0. Clearly, the weights sum to W = 35Φi (Di). Using β =
1
2 in
Lemma 3.7 (Lemma 5.9 of the appendix) it follows that with probability at least 1− 1(1−β)e > 14 ,
the potential decreases by at least βW = 310Φi (Di), concluding the proof of this lemma.
Finally, we bound the expected number of idle iterations that take place during a computa-
tion’s execution using the Low Cost Work Stealing algorithm. The result follows from Lemma 3.8
(corresponding to Lemma 5.10 of the appendix) and is proved using similar arguments as the
ones used in the proof of [7, Theorem 9]. The presented proof corresponds to an adaptation of
the one originally presented for the just mentioned Theorem.
Lemma 5.11. Consider any computation with work T1 and critical-path length T∞ being executed
by Low Cost Work Stealing using P processors. The expected number of idle iterations is at most
O (PT∞). Moreover, with probability at least 1 − ε the number of idle iterations is at most
O
((
T∞ + ln
(
1
ε
))
P
)
.
Proof. To analyze the number of idle iterations, we break the execution into phases, each com-
posed by Θ (P ) idle iterations. Then, we prove that, with constant probability, a phase leads the
potential to drop by a constant factor.
A computation’s execution begins when the root gets assigned to a processor. By definition,
the weight of the root is T∞, implying the potential at the beginning of a computation’s execution
starts at Φ0 = 43T∞−2. Furthermore, it is straightforward to deduce that the potential is 0 after
(and only after) a computation’s execution terminates. We use these facts to bound the expected
number of phases needed to decrease the potential down to 0. The first phase starts at step t1 = 1,
and ends at the first step t1′ such that, at least P idle iterations took place during the interval
[t1, t1
′ − 2C]. The second phase starts at step t2 = t1′ + 1, and so on.
Consider two consecutive phases starting at steps i and j respectively. We now prove that
P
{
Φj ≤ 710Φi
}
> 14 . Recall that we can partition the potential as Φi = Φi (Ai) + Φi (Di).
Since, from the beginning of each phase and until its last 2C steps, at least P idle itera-
tions take place, then, by Lemma 3.8 (corresponding to Lemma 5.10 of the appendix) it follows
P
{
Φi − Φj ≥ 310Φi (Di)
}
> 14 . Now, we have to prove the potential also drops by a constant
fraction of Φi (Ai). Consider some processor p ∈ Ai:
• If p does not have an assigned node, then Φi (p) = 0.
• Otherwise, if p has an assigned node u at step i, then, Φi (p) = φi (u). Noting that each
phase has more than C steps, then, p executes u before the next phase begins (i.e. before
step j). Thus, the potential drops by at least 4764φi (u) during that phase.
Cumulatively, for each p ∈ Ai, it follows Φi − Φj ≥ 4764Φi (Ai). Thus, no matter how Φi is
partitioned between Φi (Ai) and Φi (Di), we have P
{
Φi − Φj ≥ 310Φi
}
> 14 .
We say a phase is successful if it leads the potential to decrease by at least a 310 frac-
tion. So, a phase succeeds with probability at least 14 . Since the potential is an integer,
and, as aforementioned, starts at Φ0 = 43T∞−2 and ends at 0, then, there can be at most
(3T∞ − 2) log 10
7
(4) < 12T∞ successful phases. If we think of each phase as a coin toss, where
the probability that we get heads is at least 14 , then, the expected number of coins we have to
toss to get heads 12T∞ times is at most 48T∞. In the same way, the expected number of phases
needed to obtain 12T∞ successful ones is at most 48T∞. Consequently, the expected number of
phases is O (T∞). Moreover, as each phase contains O (P ) idle iterations, the expected number
of idle iterations is O (PT∞).
Now, suppose the execution takes n = 48T∞+m phases. Each phase succeeds with probability
greater or equal to p = 14 , meaning the expected number of successes is at least np = 12T∞+
m
4 .
We now compute the probability that the number of X successes is less than 12T∞. We use the
Chernoff bound [5], P {X < np− a} < e− a
2
2np with a = m4 . It follows, np− a = 12T∞. Choosing
m = 48T∞ + 16 ln
(
1
ε
)
, we have P {X < 12T∞} < e
− (
m
4 )
2
2(12T∞+m4 ) ≤ e−m16 ≤ e−
16 ln( 1ε)
16 = ε. Thus,
22
the probability that the execution takes 96T∞+16 ln
(
1
ε
)
phases or more, is less than ε. With this
we conclude that the number of idle iterations is at most O
((
T∞ + ln
(
1
ε
))
P
)
with probability
at least 1− ε.
23
