Algorithms for scheduling structured parallel computations have been widely studied in the literature. For some time now, Work Stealing is one of the most popular for scheduling such computations, and its performance has been studied in both theory and practice. Although it delivers provably good performances, the effectiveness of its underlying load balancing strategy is known to be limited for certain classes of computations, particularly the ones exhibiting irregular parallelism (e.g. depth first searches). Many studies have addressed this limitation from a purely load balancing perspective, viewing computations as sets of independent tasks, and then analyzing the expected amount of work attached to each processor as the execution progresses. However, these studies make strong assumptions regarding work generation which, despite being standard from a queuing theory perspective -where work generation can be assumed to follow some random distribution -do not match the reality of structured parallel computations -where the work generation is not random, only depending on the structure of a computation.
October 26, 2018
Abstract Algorithms for scheduling structured parallel computations have been widely studied in the literature. For some time now, Work Stealing is one of the most popular for scheduling such computations, and its performance has been studied in both theory and practice. Although it delivers provably good performances, the effectiveness of its underlying load balancing strategy is known to be limited for certain classes of computations, particularly the ones exhibiting irregular parallelism (e.g. depth first searches). Many studies have addressed this limitation from a purely load balancing perspective, viewing computations as sets of independent tasks, and then analyzing the expected amount of work attached to each processor as the execution progresses. However, these studies make strong assumptions regarding work generation which, despite being standard from a queuing theory perspective -where work generation can be assumed to follow some random distribution -do not match the reality of structured parallel computations -where the work generation is not random, only depending on the structure of a computation.
In this paper, we introduce a formal framework for studying the performance of structured computation schedulers, define a criterion that is appropriate for measuring their performance, and present a methodology for analyzing the performance of randomized schedulers. We demonstrate the convenience of this methodology by using it to prove that the performance of Work Stealing is limited, and to analyze the performance of a Work Stealing and Spreading algorithm, which overcomes Work Stealing's limitation.
Introduction
The main goal of a structured computation's scheduler is to guarantee the fast completion of the execution of arbitrary structured computations. For some time now, Work Stealing is one of the most popular algorithms for scheduling structured computations [ABP98, ABP01, BL99, ACR13, BFG03, CKK + 08, DLS + 09, HS02, QW10]. In Work Stealing (or WS for short), each processor owns a deque that uses to keep track of its work. Busy processors operate locally on their deques, adding and retrieving work from them as necessary, until they run out of work. When that happens, a processor becomes a thief and starts a stealing phase, during which it targets other processors, uniformly at random, in order to steal work from their deques.
As proved in [ABP01, BL99] , the expected execution time of any computation using WS is asymptotically optimal. Nevertheless, WS's performance is known to be limited for the execution of computations that exhibit irregular parallelism (e.g. depth first computations where only a few threads actually generate work) [ACR13, BFG03, CKK + 08, DLS + 09]. For coping with this limitation, numerous studies have been resorting to the use of steal-half deques [ACR13, DLS + 09, HS02 , QW10] which allow thieves to take up to half of the work of their victims in a single steal operation. The adoption of steal-half strategies by real-life schedulers has been mostly justified by the strategies' importance on distributed memory environments, where each steal attempt incurs in significant latency, making it worth to transfer a larger amount of work in a single steal. On the other hand, the steal-half strategy has been formally proved, from a queuing theory perspective, to be an effective load balancing method for schedulers of independent tasks [BFG03] . However, while this strategy may be ideal for independent task scheduling from a queuing theory perspective -where tasks are assumed to arrive at a system according to some probability distribution, and work transfers are assumed to take constant time regardless of the amount of tasks transferred -it remains unknown whether it is suitable for structured computation scheduling -where work generation only depends on the structure of a computation, and where the time for a processor to transfer work from another processor is proportional to the amount of work transferredand so the problem of how to cope with WS's limitation remains open. Even more importantly, while there are well established methods for analyzing the performance of the load balancers of independent task schedulers -usually based on the analysis of Markov chains -and a well-defined goal -which is typically to assure that the system's load does not grow unboundedly over time -to the best of our knowledge there are no well-defined methods suitable for analyzing the performance of the load balancers of online structured computation schedulers, nor even well-defined goals.
To this extent, the contributions of this paper are:
• A formal framework for studying the performance of structured compu-tation schedulers (Section 2). One of the key features of this framework is that it can be used to model most, if not all, practical scheduling algorithms.
• The definition of algorithm short-term stability (Section 2.1), which is an appropriate criterion for measuring the performance of online structured computation schedulers.
• A methodology that allows to effectively study the performance of randomized computation schedulers (Section 3). We demonstrate its convenience by: 1. using it to prove that the performance of WS is indeed limited (Section 3.2); and 2. presenting a (purely theoretical) variant of the WS algorithm where processors attempt to spread work as it is generated, and then using our methodology to show that the algorithm overcomes the identified limitations of WS (Section 4). Despite being purely theoretical, the algorithm we present gives us insight on how the limitations of WS can be addressed.
A criterion to measure the performance of computation schedulers
Like in much previous work [ABB02, ALHH08, ABP01, BL99, MA16, TGT + 10], we model a computation as a dag G = (V, E), where each node v ∈ V corresponds to an instruction, and each edge (µ 1 , µ 2 ) ∈ E denotes an ordering constraint (meaning µ 2 can only be executed after µ 1 ). Nodes with in-degree of 0 are referred to as roots, while nodes with out-degree of 0 are called sinks. We make the two standard assumptions related with the structure of computations. Let G denote a computation's dag: 1. there is only one root and one sink in G; and 2. the out-degree of any node within G is at most two. We consider that processors operate on discrete time steps, each executing one instruction -that may or may not correspond to a computation node -per time step. The execution of a computation is carried out by a set of processors denoted by P rocs whose cardinality is denoted by P . We assume that P ≥ 2 (i.e. P rocs is composed by at least two processors), and that all processors operate synchronously in time steps. Therefore, a computation's execution can be partitioned into discrete time steps, such that at each step every processor executes an instruction. We refer to these time steps using non-negative integers, where 0 is the first step and i + 1 is the step succeeding i.
Definition 2.1. At any step during a computation's execution each node of the computation is in exactly one of the following states: not readyif its predecessors have not yet been executed; ready -if its predecessors have been executed, but not the node itself; and executed -if the node has been executed.
As one may note, a node can only be ready if all the ordering constraints wrt (with respect to) the node are satisfied. For example, at the first step of a computation's execution every node (except for the root) is not ready. To ensure the correct execution of a computation, only nodes that are not ready can become ready, and only nodes that are ready can become executed. For each step i, refer to the set of nodes that are: 1. not ready by N onReady i ; 2. ready by Ready i (or simply R i ); and 3. executed by Executed i . Since only the nodes that are ready can become executed, Executed i can alternatively be defined as Executed i = j∈{1,...,i−1} R j − R i (i.e. the set of all nodes that were once ready, but no longer are).
For each step i, partition R i into P sets (one per processor), and refer to R i (p) -processor p's partition of R i -as the set of nodes that are attached to processor p at step i. Say that a node was enabled at step i if it was not ready at step i but is ready at step i + 1, and, similarly, that a node was executed at step i if it was ready at step i but is executed at step i + 1. In addition, say that a node µ is migrated if µ ∈ R i (p) and µ ∈ R i+1 (q), where p = q, for p, q ∈ P rocs. The next definition formalizes these ideas.
Definition 2.2. For each step i and processor p, define the set of nodes enabled by p as
Moreover, define the set of nodes migrated from p to all other processors as M
For a set of processors S ∈ P(P rocs),
Having defined these sets of nodes, we now introduce rounds. Informally, a round is a sequence of time steps with constant length, such that every processor executes at most one node and no ready node is migrated more than once. Definition 2.3. A round is a sequence of L time steps (for some constant L ≥ 1) such that a computation's execution can be partitioned into equallength rounds and for every round: 1. no processor executes more than a single node; and 2. no node is migrated more than once.
Analogously to time steps, refer to rounds using non-negative integers, but with an additional bar, where 0 denotes the first round. Throughout this paper, we let L denote the length of rounds and t [i] denote the i-th step of a round t (for i ∈ {0, . . . , L − 1}). As we will see, the length of rounds depends on the scheduling algorithm. Now, we reintroduce the concepts we have already presented above concerning the states of nodes, but this time considering rounds rather than steps.
Definition 2.4. For each round t and processor p, define the set of nodes attached to p at round t as R t (p) = R t[0] (p), enabled by p during t as
executed by p during t as
migrated from p to all other processors during t as
and migrated from all other processors to p during t as
The following result, Lemma 2.5, states that the set of nodes migrated to a processor p during a round t is a subset of all the nodes that are migrated from all processors but p during t. The proof of the following result can be found in the Appendix (Section A.1).
Lemma 2.5. For any round t and p ∈ P rocs, M
As one may note, the definition of round (Definition 2.3) implies that for any processor p and round t, |C t (p)| ≤ 1 and M
e. no two processors migrate the same node during the same round). By Lemma 2.5, it then follows
The next lemma is essential for the rest of our analysis, as it shows the connection between the set of nodes that are attached to each processor p at some round t, and the set of nodes that are attached to p at round t + 1. The proof of this result can be found in the Appendix (Section A.2).
Lemma 2.6 (Round Progression Lemma). For any round t and processor p ∈ P rocs, 
Algorithm short-term stability
We now move to present algorithm short-term stability, the criterion that will be used to measure the performance of structured computation schedulers. We begin by stating the following requirement, which gives us the guarantee that a processor p only executes a node µ during a round t if µ is attached to p at the beginning of that round.
Requirement 2.7. For any round t and p ∈ P rocs, we must have
Definition 2.8 (Busy and Idle Processors). Say that a processor p ∈ P rocs is idle during a round t if C t (p) = ∅, and, otherwise, say that p is busy. Moreover, denote the number of idle processors during a round t by P idle t , and define α t as the ratio of idle processors, α t = P idle t /P . Now, we introduce the notion of short-term stability. Intuitively, a set of processors S is short-term stable for some round t if the number of nodes attached to the processors in S that are not executed is expected to monotonically decrease from round t to round t + 1.
Definition 2.9 (Short-term stability). A set of processors S ∈ P(P rocs) is short-term stable for some round t during a computation's execution if
Ideally, we would want to ensure short-term stability for all rounds and wrt all processors (i.e. S = P rocs). However, since a processor can enable two nodes during one round, a scheduler may only be able to guarantee short-term stability wrt all processors if at least half of them are idle during a round. For this reason, we will now move to introduce Algorithm short-term stability, which is based on the same rationale as short-term stability.
For each round t, we classify processors according to whether they execute all their attached nodes during t or not. If a processor p executes all its attached nodes during round t (i.e. if R t (p) = C t (p)), then we say that p is self-stable at round t. Otherwise, we say that p is non-self-stable at round t.
Definition 2.10. Define the set of self-stable and non-self-stable processors at some round t as S t = {p ∈ P rocs | R t (p) = C t (p)} and U t = P rocs − S t , respectively.
Having this, we can finally define Algorithm short-term stability, our criterion for measuring computation schedulers' performance. Informally, the main idea is that if the ratio of idle processors at some round t is sufficiently high, then the amount of work attached to non-self-stable processors is expected to decrease, and, at the same time, the amount work attached to self-stable processors does not grow unboundedly.
Definition 2.11 (Algorithm short-term stability). A scheduling algorithm is algorithm short-term stable with respect to an interval I ⊆ ]0; 1[, iff (if and only if) for any round t,
where L denotes the length of the rounds.
Note that, contrarily to short-term stability, algorithm short-term stability requires that the expected number of nodes attached to processors of U t , that are not executed, strictly decreases from a round to the next. In addition, by limiting the number of nodes that can become attached to a self-stable processor during a round, we disallow scheduling algorithms to keep ping-ponging work between non-self-stable and self-stable processors throughout the execution. The insight for bounding the number of nodes to the length of rounds is that we are enforcing processors to have to accept each node they are given.
Intuitively, if an algorithm is algorithm short-term stable wrt some nonempty interval I, then the algorithm's load balancer is sufficiently effective to guarantee that, regardless of the computation it is scheduling, for any round t such that α t ∈ I, its performance is expected to be good (or, in other words, work accumulation is not expected). On the other hand, if an algorithm cannot guarantee algorithm short-term stability wrt any non-empty interval I for any round during the execution of an arbitrary computation, then the effectiveness of its load balancer is limited, and thus may lead to work accumulation depending on the structure of the computation being executed.
As one may note, the definition of Algorithm short-term stability relies on the overall behavior of the set of processors U t , for each round t. However, it is much simpler to reason about the behavior of each processor p ∈ U t , than it is to reason about the behavior of all processors of U t . The next result is then crucial for the rest of our analysis, as it shows the relation between the behavior of all the processors of a set of processors S and the behavior of each individual processor p of S.
Lemma 2.12. For any round t and S ∈ P (P rocs), if
Proof. First, recall that R t is partitioned through the processors. By Requirement 2.7, C t is also partitioned through the processors and ∀p ∈ P rocs,
To conclude the proof, note that by the linearity of expectation,
The following result is base for the analysis of schedulers, as it relates, for each round t, the difference in the number of nodes that a processor p enables during t, that are migrated from p during t, and that p executes during t + 1, with a result that is closely related with algorithm short-term stability (the corresponding proof can be found in the Appendix, Section A.3).
Lemma 2.13. For any round t and processor p ∈ P rocs,
The following Corollary then follows from Lemma 2.13.
Corollary 2.14. For any round t,
A method to analyze randomized schedulers
In order to analyze the performance of randomized scheduling algorithms, we introduce a few additional definitions and make some assumptions that are necessary to permit ordering the actions that processors take during the execution of computations, and, in particular, during each round. The reason for the need to order the actions of processors will become apparent as we use it to analyze the WS algorithm. To aid the reader, as we present the extra definitions and assumptions that our methodology requires, we use a WS algorithm (depicted in Algorithm 1) to instantiate them and explain their meaning. The WS algorithm we analyze is a synchronous but behaviorally equivalent variant of the original non-blocking algorithm given in [ABP01] . Thus, each processor owns a lock-free deque object that supports three methods: pushBottom, popBottom and popT op. Only the owner of a deque may invoke the pushBottom and popBottom methods, which, respectively, add a node to the bottom of the deque, and remove and return the bottommost node of the deque, if any. The popT op method is invoked by processors searching for work, and for each invocation to this method, the deque's current topmost node is guaranteed to be removed and returned, either by such invocation or by some concurrent one 1 . In addition to the deque, each processor has a variable assigned that stores the node that it will execute next, if any.
The methodology
First of all, we require that the scheduling algorithm to be analyzed must be defined by a cycle such that:
1. at most one of the instructions composing any particular iteration of this cycle may correspond to a node's execution;
2. no node that is migrated to a processor p, who is executing an iteration of this cycle, can be migrated again (to another processor), before p finishes the current iteration;
3. the length of any sequence of instructions that corresponds to some execution of this cycle is at most constant; and 4. the full sequence of instructions executed by any processor can be partitioned into smaller sub-sequences, each corresponding to a particular execution of this cycle.
Refer to this cycle as the scheduling loop, and to any sequence of instructions that correspond to some iteration of a scheduling loop as scheduling iteration.
As it can be observed in Algorithm 1, the definition of the WS algorithm naturally fits into scheduling loops (corresponding to lines 2 to 19): 1. at most one of the instructions within the sequence of a scheduling iteration corresponds to the execution of a node (line 4); 2. no node that is migrated to a processor is migrated ever again, as it becomes the processor's new assigned node (line 23); 3. the length of any iteration of the scheduling loop is bounded by a constant; and 4. the full sequence of instructions executed by any processor can be partitioned into scheduling iterations.
To order the actions that processors take during scheduling iterations, each iteration can be partitioned into a sequence of phases. In particular, for WS, each iteration is partitioned into two phases: and node = none 30: end function Phase I If a processor has a valid assigned node, it executes the node. Otherwise, it makes a steal attempt, and if the attempt succeeds the stolen node becomes the processor's new assigned node. takes during the execution of every iteration. However, this ordering by itself does not meet our needs, as we have to guarantee that all processors start the execution of each phase of every scheduling iteration at the same time. Our first step towards that goal is to require all processors to begin working at the exact same time. Refer to the step at which a processor p executes its i-th instruction as χ (p, i).
Requirement 3.1. ∀p ∈ P rocs, χ (p, 1) = 0. Now, we present the synch procedure, which allows to synchronize processors at the end of each phase. The synch procedure takes two input parameters: 1. maxP haseLength -the length of a longest sequence of instructions that may compose a given phase, and; 2. currentP haseLength -the number of time steps during which the processor has been executing the current phase, until the procedure's invocation. Given these parameters, synch adds a sequence of maxP haseLength − currentP haseLength no-op instructions, guaranteeing that the number of steps taken from the beginning of each phase's execution until the end of its call is the same for all processors. To use synch, we rely on the purely theoretical procedure ι to obtain the value of currentP haseLength.
For last, we partition a computation's execution even further, by partitioning all rounds equally into sequences of stages. To formalize this idea, define a stage partitions ∈ N × N, ass = (base, offset), with offset > 0, where base and offset are, respectively, the starting step and length of the stage defined bys within each round. Refer to the i-th stage of a round t as t i . Definition 3.2. Let L be the length of the rounds. Say that a setS is a set of stage partitions if L = s∈S π 2 (s) and ∀s ∈S [∃r ∈S :
, where π i (t) denotes the projection of the i-th element of tuple t.
Remark 3.3.
To analyze a scheduler's performance using our methodology it suffices to: 1. define the scheduler by a scheduling loop; 2. divide the actions that processors take during each iteration of the loop (by partitioning each scheduling iteration into phases); and 3. insert a call to the synch procedure at the end of each phase.
Justification. By using the synch and ι procedures, one can guarantee that any scheduling algorithm, that may be defined by a scheduling loop, can be modified so that processors are kept synchronized throughout any computation's execution, having that all processors begin the execution of the i-th phase of the n-th scheduling iteration at the exact same step. With this, the length of each round can be set to i∈P hases length i , where P hases denotes the set of phases that compose a scheduling iteration and length i denotes the length of the i-th phase 2 . Note that, since all processors execute each scheduling iteration synchronized, the definition of scheduling loop ensures us that the requirements of the definition of round are satisfied: 1. each round has constant length; 2. a computation's execution can be partitioned into a sequence of equal-length rounds; 3. during each round no processor executes more than a single node; and 4. no node is migrated more than once during a round. Then, it only remains to partition each round into a sequence of stages, having one stage per phase, and ensuring that the execution of the i-th phase of a scheduling iteration coincides with the i-th stage of the corresponding round.
In the synchronous WS scheduler, depicted in Algorithm 1, max_phase I _length and max_phase II _length are two constants that correspond to the lengths of the longest sequences of instructions composing the first and second phases of WS, respectively. Thus, by Remark 3.3, we can set the length of WS's rounds to max_phase I _length + max_phase II _length, and partition each such round into two stages whose length matches the maximum length of the corresponding phase. To proceed to analysis of WS's performance, it only remains to show that WS satisfies Requirement 2.7. For WS, say that a node µ is attached to a processor p if one of the following conditions holds: 1. µ is p's currently assigned node; 2. µ is stored in p's deque; or 3. µ is stored in enabled [0] or enabled [1] (see line 4 of Algorithm 1). At the beginning of any round, each node that is attached to a processor is either in its deque or is the processor's currently assigned node. As it can be observed in Algorithm 1, each processor only executes the node that is stored in its assigned variable. Since the value of this variable is not changed at least until the processor executes the node, then the node was already stored in the assigned variable when the round began, and so the requirement is satisfied.
Work Stealing's performance
To show that the synchronous WS algorithm (as defined in Algorithm 1) is not Algorithm short-term stable, we will create a computation for which work tends to accumulate unboundedly in some busy processors' deques. Before moving to the actual proof, however, we have to make an additional definition.
Definition 3.4. Refer to the set of nodes stolen at step i from a processor p as Stolen + i (p), and to the set of nodes stolen by p as Stolen − i (p). Moreover, for some round t, define the set of nodes stolen during t from p as
and the set of nodes stolen by p as
Proof. Both results follow from Definition 3.4 and the specification of Algorithm 1.
We now move to obtain both lower and upper bounds on the expected number of nodes that are stolen from a non-self-stable processor during a round.
Lemma 3.6. Suppose there are B bins and B.α balls, and that each ball is tossed independently and uniformly at random into the bins. For a bin b i , let Y i be an indicator variable, defined as
Proof. The probability that no ball lands in
Lemma 3.7. For any round t and p ∈ U t during a computation's execution using WS, we have 1 − e −α t ≤ E[|Stolen
Proof. By observing Algorithm 1, it follows that a processor makes a steal attempt iff it is idle, implying that exactly P α t steal attempts are made during round t. Note that: 1. steal attempts are independent from one another; and 2. a steal attempt corresponds to targeting a processor uniformly at random and then invoking the popT op method to its deque. If we imagine that each steal attempt is a ball toss and that each processor's deque is a bin, it follows by Lemma 3.6 that the probability of p's deque being targeted is at least 1− e −α t . On the other hand, the expected number of invocations to the popT op method of any processor p's deque is (P α t ) /P = α t . Since p may only invoke the popBottom method of its deque during the second phase and the all the steal attempts take place during the first phase, then, taking into account the deque semantics (see Section B): 1. if p's deque is targeted by at least one steal attempt, then at least one node is stolen; and 2. at most one node might be returned for each invocation to the popT op method. Thus, E[|Stolen
Lemma 3.8. For any round t, and p ∈ U t we have
where α t is the ratio of idle processors.
Proof. Lemmas 3.5 and 3.7 imply this result.
The next lemma follows from the behavior of the WS algorithm.
Lemma 3.9. Consider some processor p ∈ P rocs and some round t during the execution of a computation by WS. If p ∈ U t then p's deque is non-empty and
Proof. By the definition of Algorithm 1 it can be proved by induction on the progression of a computation's execution that if a processor has at least one attached node at the beginning of round t, then the processor executes a node during t. From that, and by observing the algorithm, it follows that if p has at least one node attached, then it does not make any steal attempt during t, implying Stolen
On the other hand, since p always executes one of its attached nodes if there is any, it follows that if p ∈ U t then p's deque is not empty.
If p only has a single attached node, then p ∈ S t . Because it has one attached node, it follows Stolen − t (p) = ∅. Again, Lemma 3.5 then implies M − t (p) = ∅. In addition, since the out-degree of any node is at most two (by our conventions regarding computations' structure), then at the end of the round p has at most two attached nodes.
Finally we show that if R t (p) = ∅, then at the end of the round p has at most one attached node. If p has no attached node, then its assigned variable does not contain a valid node, implying that p executes a call to the W orkM igration procedure. Since each call only entails one invocation to the popT op method, then, taking into account the method's semantics 3 it follows that p may only get at most one node from its steal attempt. Since after performing such attempt, p takes no further action during the scheduling iteration other than simply waiting for it to end, we conclude the lemma holds.
From Lemma 3.9 and the definition of round (Definition 2.3), it follows ∀p ∈ S t , R t+1 (p) ≤ 2 ≤ L + 1, where L is the length of the rounds, which is at least 1. Thus, if we were to show that WS is algorithm short-term stable wrt some interval I ⊆ ]0; 1[, then, by Corollary 2.14, we would only have to prove that for any round t such that α t ∈ I, we had
. Unfortunately, as we now prove, there is no nonempty interval I wrt which WS is algorithm short-term stable. Proof. Due to our conventions related with the computations' structure, it follows that during a round a processor can enable two nodes. For some round t, let p be a non-self-stable processor (i.e. p ∈ U t ) such that |E t (p)| = 2. Lemmas 2.6 and 3.9 imply
As already noted in the proof of Lemma 3.9, for the WS algorithm, since p ∈ U t , it follows that |C t (p)| = 1. Since we have 1.
Since p enabled two nodes, it executes a node during the next round, implying
Even though the definition of algorithm short-term stability only considers ratios of idle processors in ]0; 1[, note that for WS, if all processors are idle, then the computation's execution must have already finished, and so it only makes sense to analyze rounds during which the execution is still ongoing. It then follows that
As one might note, this implies that WS cannot even guarantee shortterm stability for the set {p} (recall Definition 2.9), regardless of the ratio of idle processors.
A greedy Work Stealing and Spreading algorithm
In this section we present and analyze the performance of a (purely theoretical) greedy Work Stealing and Spreading scheduler -or simply WSS. This algorithm (depicted in Algorithm 2) is a variant of WS where processors load balance not only by stealing work, but also by spreading it. As in WS, each processor owns a lock-free deque (obeying the semantics defined in [ABP01] ) and a variable assigned that stores the node that it will execute next, if any. To implement the spreading mechanism each processor additionally owns a state flag and a donation cell. Processors use the state flag to inform other processors on their current state -working, idle or marked as target of a donation (more on this ahead) -and use the donation cell to store nodes that they want to spread. In WSS processors are uniquely identified by an id, with which they can be accessed in constant time. The scheduler also makes use of the CAS instruction (Compare-And-Swap), with its usual semantics. Thus, at most one CAS instruction targeting the same memory location can successfully execute at each step. We assume that the processor that succeeds executing the CAS instruction over a memory address m at some step i is chosen uniformly at random from the set of processors that are eligible to successfully execute the instruction at step i over memory address m.
Contrarily to WS, we partition each scheduling iteration of WSS into three phases. Phase I of WSS is very similar to the WS's counterpart, only differing because in WSS processors keep updating their state flags to reflect their current state. Phases II and III of WSS are as follows:
Phase II If, in phase I, a processor p made a steal attempt or executed a node that did not enable any node, then p does not take any action during this phase. Otherwise, if at least one node was enabled, one of the enabled nodes becomes p's new assigned node. If two nodes were enabled, then, after having a new node assigned, p attempts to spread the node it did not assign.
Phase III If a processor p executed a node in phase I but no node was enabled, p invokes popBottom to fetch the bottommost node from its deque, if there is any. On the other hand, if a single node was enabled, p does not take any action during this phase. If two nodes were enabled, p only takes action if the donation attempt it made during phase II failed. In such scenario, p pushes the node it failed to donate into the bottom of its own deque, via the pushBottom method. Finally, if the processor made an unsuccessful steal attempt during the first phase, it polls its state flag to check for incoming donations. If there is a donation, p transfers the node from the donor's donation cell and updates its state flag accordingly.
Definition 4.1. Refer to the set of nodes spread at step i by a processor p as Spread + i (p), and to the set of nodes spread to p as Spread − i (p). Moreover, for some round t, define the set of nodes spread during t by p as
and the set of nodes spread to p as
The next claim implies that we can use the methodology described in Section 3 to analyze the performance of WSS.
Claim 4.2. The WSS algorithm can be defined using a scheduling loop and meets Requirement 2.7.
Proof Sketch. As one can observe from Algorithm 2, like WS, WSS can also be naturally defined using scheduling loops (lines 2 to 24): 1. at most one node is executed per scheduling iteration; 2. if a node is migrated to a processor during an iteration, then it is not migrated again; 3. the length of every iteration is bounded by a constant; and 4. the full sequence of instructions executed by any processor can be partitioned into a sequence of scheduling
Proof. Both results follow from Definition 3.4 and Algorithm 2.
The next lemma is analogous to Lemma 3.9, but concerning the WSS algorithm.
Lemma 4.4. Consider some p ∈ P rocs and some round t during the execution of a computation by WSS. Then: 1. if p ∈ U t then p's deque is non-empty and M − t (p) = ∅; and 2. if p ∈ S t then R t+1 (p) ≤ 2.
Proof. This proof follows the same general arguments as the proof of Lemma 3.9.
As for WS, taking into account the definition of the WSS scheduler (depicted in Algorithm 2) it can be proved by induction on the progression of the computation's execution that if a processor has at least one attached node at the beginning of round t, then the processor executes a node during t. From that, and by observing the algorithm, it follows that if p has at least one node attached, then: 1. p does not make any steal attempt during t, implying Stolen − t (p) = ∅; and 2. p's state flag is set to working at least until the beginning of the third stage of round t, implying that no processor donates work to p, and so Spread To conclude the proof of the first statement of this lemma, note that because p always executes one of its attached nodes as long as there is any, it follows that if p ∈ U t then p has at least two nodes attached and so p's deque can not be empty.
Again, since p executes one of its attached nodes as long as there is any, if p only has a single attached node then p ∈ S t and M nodes' out-degree assumption, it then follows that at the end of round t, p can have at most two attached nodes.
Finally we show that if R t (p) = ∅, then at the end of the round p has at most two attached nodes. If p has no attached node, then its assigned variable does not contain a valid node, implying that p executes a call to the LoadBalance procedure (line 22). Since each call only entails one invocation to the popT op method, then, taking into account the method's semantics (see Section B) it follows that p may only get at most one node from its steal attempt. On the other hand, as one can deduce from the definition of the LoadBalance procedure, p can only accept at most one node donation during the procedure's invocation 5 . Thus, at the end of the call p can only have at most two attached nodes. To conclude the proof of the second part of the lemma, note that, after the call to the LoadBalance procedure returns, p takes no further action during the iteration.
With this, we start obtaining bounds on the expected number of nodes that are migrated during a round for WSS. To begin, we obtain both lower and upper bounds on the expected number of nodes stolen for WSS.
Lemma 4.5. For any round t and p ∈ U t during a computation's execution using WSS, 1 − e −α t ≤ E[|Stolen
Proof. The proof of this result is identical to the proof of Lemma 3.7, following from Lemma 4.4 and the definition of WSS (see Algorithm 2).
Next, we obtain lower bounds on the expected number of nodes spread by any processor that enables two nodes. A full proof of this result can be found in the Appendix (Section C.1).
The following Lemma, together with Lemmas 4.5 and 4.6, allow us to obtain bounds on the expected number of nodes that are migrated from a processor. A full proof can be found in the Appendix (Section C.2).
Lemma 4.7. For WSS, at any round t, Stolen
The next result states an inequality that will be used to prove that WSS is algorithm short-term stable wrt interval [0, 7375; 1[, and its full proof can be found in the Appendix (Section C.3). 
Proof.
Recall that L denotes the length of each round, and that by definition L ≥ 1. Then, from Lemma 4.4, it follows that ∀p ∈ S t , R t+1 (p) ≤ 2 ≤ L+1. Furthermore, taking into account Lemma 2.12 and Corollary 2.14, it follows that to prove this theorem it suffices to show that for any round t such that
For an arbitrary round t, consider any processor p ∈ U t (i.e. p is a processor that is non-self-stable at round t). Due to our conventions related with computations' structure, |E t (p)| is either 0, 1, or 2:
• If |E t (p)| = 1 then, by the specification of the WSS algorithm, it follows C t+1 (p) = 1. Taking into account Lemma 4.5, we deduce
• By the specification of the WSS algorithm, it follows that if |E t (p)| = 2 then C t+1 (p) = 1. Thus, to prove this case it suffices to show 1 − e −(1−α t ) . Thus, taking into account Lemma 4.8, having α = α t , we conclude the proof of Theorem 4.9.
Related work
To the best of our knowledge, there is no work that analyzes the performance of online structured computation schedulers, on a round basis, depending solely on the ratio of idle processors.
Most theoretical work dealing with the study of online structured computation schedulers, has focused on proving properties related with the (complete) execution of computations by WS and variants. Blumofe et al. proved that WS is optimal up to a constant factor in terms of space requirements, expected execution time, and expected communication costs [BL99] . Arora et al. showed that WS is optimal even for multiprogrammed environments [ABP98, ABP01]. Agrawal et al. introduced a variant of WS that avoids unnecessary load balancing cycles in order to achieve higher efficiency [AHHL07, ALHH08]. The authors proved that WS is capable of maintaining nearly optimal bounds, while reducing the number of cycles during which processors are not making progress on a computation's execution (corresponding to load balancing cycles), down to a constant factor away from the computation's total amount of work. Regarding data locality, Acar et al. obtained both lower and upper bounds on the number of cache misses using WS [ABB02] . More recent research has been focusing on reducing the synchronization overheads of WS [ACR13] While, for the execution of structured computations, work generation depends on what has already been executed, for independent task scheduling, work generation (or, more correctly, task arrival) is assumed to be independent from what tasks processors already executed [ACMR98, ABKU99, BFG03, LM93, Mit98, MPS02]. In fact, much of the work in this area consists on studying the effectiveness of different strategies (that rely on randomness) for placing n balls (each representing a task) into n bins (each representing a processor) [ACMR98, ABKU99, LM93, MPS02], being that a strategy's effectiveness is measured according to the number of balls that the fullest bin is expected to have: the lower this number is, the more effective the strategy is. Of course, this type of models, despite being suitable for modelling independent task schedulers, are far from being apt to model the performance of structured computation schedulers (for example, note that in the execution of a structured computation, work is generated per processor). Within the area of independent task scheduling, perhaps the work most closely related to ours is on the performance analysis of online independent task schedulers [BFG03, Mit98] . Yet, to the best of our knowledge, all the analyzes made to these schedulers rely upon the assumption that tasks arrive to the system according to some random distribution (typically Poisson's distribution). For instance, Mitzenmacher proposed a simple but powerful scheme to analyze independent task work stealing schedulers, that uses differential equations [Mit98]. This scheme allows to study not only the most basic work stealing schedulers (of independent tasks), but also more complex variants (e.g. allowing processors to repeat a steal whenever its steal attempt aborted). Nevertheless, the proposed scheme relies on the assumption that work is generated according to some random distribution, and so it is not suitable for modelling the behavior of structured computation schedulers. Berenbrink et al. study the performance of inde-pendent task work stealing schedulers, modelling the system as a Markov chain, whose states denote the number of tasks attached to each processor of the system [BFG03] . The authors proved that the work stealing scheduler for independent tasks, where each steal is allowed to take up to half of a processor's work, is stable for a long term execution. Unfortunately, their analysis also relies on the assumption that tasks arrive at the system according to a random distribution, and so it is not apt to model the performance of structured computation schedulers. In addition, the authors assume that the number of tasks generated at each round is at most the number of processors, which, taking into account the standard conventions regarding the structure of computations [ABP98, ABP01, BL99, ACR13], is not realistic for modelling schedulers of structured computations.
Although it may not be entirely straightforward, it is possible to use our methodology to model the steal-half work stealing algorithm. To do so, each steal would have to be divided into a sequence of scheduling iterations, such that during each iteration the thief transferred a node from its victim. However, transferring half of a processor's work may take some time, which not only implies that the thief will have to wait until it can begin executing what it stole, but it also means that either concurrent steal attempts to the same deque are delayed (to avoid duplicate steals), or thieves have to first transfer the work they intend to steal from their victims and only then attempt to commit the steal. Regarding the later option, note that if a thief is transferring work from one of the only processors that is generating work, then the steal attempt is likely to fail. Moreover, since during each round a processor can enable two nodes, then, it would still be possible that the processor whose deque was being stolen generated a large amount of work.
Conclusion
We introduced a formal framework for the performance analysis of structured computation schedulers, and defined an appropriate criterion for measuring the performance of online scheduling algorithms: algorithm short-term stability. Moreover, we introduced a simple and powerful method that allows to analyze the performance of these schedulers, and have demonstrated its convenience by using it with two different ends: 1. proving that the performance of WS is indeed limited; and 2. analyzing the performance of WSS. Although WSS is a purely theoretical algorithm, its analysis gave us insight on how to possibly overcome the limitation of WS. Nevertheless, the greedy spreading strategy of the algorithm has a severe limitation that makes us question its practical value: even if every processor is busy, whenever a processor generates work it makes a spread attempt. This not only makes processors incur in unnecessary overheads (that, for modern computer architectures, are unduly large) but even more importantly, it entails a serious drawback concerning the communication costs of the algorithm. Consequently, it is still an open problem to come up with a practical algorithm that overcomes WS's limitation while maintaining its asymptotically optimal expected execution time and communication costs, and its low space requirements.
[ 
Proof of Lemma 2.5. Claim A.1 implies that for any
. Thus, by Definition 2.4 we conclude this lemma holds.
A.2 Full proof for Lemma 2.6 (Round Progression Lemma)
Claim A.2. For any round t and processor p ∈ P rocs, R t+1 (p)∩M
Proof. For the purpose of contradiction, assume R t+1 (p)∩M
If j were t [L − 1], then S = ∅, and so, as one can deduce,
Since a node that is ready can only become executed, and a node in state executed does not change its state, it follows ∀i ∈ j, . . . , t
Proof of Lemma A.3. By Definition 2.2, it follows
Lemma A.4. For any steps s 0 , s 1 , with s 1 > s 0 , and processor p ∈ P rocs,
Proof of Lemma A.4. Prove this lemma by induction.
Base case For the base case, let s 1 = s 0 + 1. Then,
Taking into account Lemma A.3, we conclude the base case holds.
Induction step Assume that the result holds for some s l > s 0 , and show that it also holds for s l + 1. The induction hypothesis is
and prove
Again, taking into account Lemma A.3, it is easy to deduce that the induction hypothesis holds, implying the lemma holds.
Lemma A.5. For any round t and processor p ∈ P rocs
Proof of Lemma A.5. By Definitions 2.1, 2.2 and 2.4, it follows Claim A.6. For any round t and processor p ∈ P rocs,
Proof. First, for an arbitrary step s 0 prove by induction that for any step s 1 such that s 1 > s 0 ,
To conclude the proof of the base case, note that
and we prove
To conclude this proof, let s 0 = t [0] and
Claim A.7. For any steps s 0 , s 1 such that s 1 > s 0 :
Proof. Prove this claim for an arbitrary s 0 by induction on s 1 .
Base case For the base case, consider s 1 = s 0 + 1. Then
Induction step Assuming the claim holds for s l ≥ s 0 + 1, prove the claim also holds for s l + 1. Since by the induction hypothesis
To conclude,
Claim A.8. For any steps s 0 , s 1 such that s 1 > s 0 ,
To conclude the proof of the base case, note that by Definition 2.2
Induction step Assuming the claim is true for s l ≥ s 0 + 1 show that it holds for s l + 1. Thus, using the induction hypothesis
Lemma A.9. For any round t and processor p ∈ P rocs
Proof of Lemma A.9. To prove this direction of the lemma, it suffices to show:
Prove each of these propositions:
1. Claim A.6 implies Proposition 1 holds.
35
2. To prove Proposition 2:
By Claim A.7, letting s 0 = t [0] and
3. By Claim A.8, letting s 0 = t [0] and
To conclude this proof note that
Proof of Lemma 2.6. Lemmas A.5 and A.9 imply this result.
A.3 Full proof for Lemma 2.13 (Connecting Lemma)
Claim A.10. For any round t and p ∈ P rocs, E t (p) ∩ R t (p) = ∅.
Proof. Given an arbitrary step s 0 , we prove by induction on a step s 1 (where
Induction step To prove the induction step, assume the lemma holds for a step s l > s 0 and then prove that it also holds for s l+1 .
To conclude the proof, let s 0 = t [0] and
Claim A.11. For any round t and p ∈ P rocs,
Proof. For the purpose of contradiction, let us assume
Thus, at least one of the following propositions has to hold:
To conclude the proof of this claim, we prove that none of the propositions holds, contradicting our hypothesis that
If j were t [0], then S = ∅, and so, j > t [0]. Consider any node µ ∈ S. Since a node that is ready can only become executed, and a node in state executed does not change its state, it follows
(p) = ∅, which, together with Lemma 2.5, contradicts Definition 2.3 -the definition of a round.
Contradiction for Proposition 2 If
Consider any node µ ∈ S.
If a node is not in state ready at step m, then it is either in state not ready or executed. Thus, at step m, µ is either in state not ready or executed. Because at step m + 1 µ is in state ready, and since a node that is in state executed does not change its state, we deduce that µ is in state not ready at step m. Definition 2.1 then implies that until step m (including m), µ has been in state not ready.
By Definition 2.2, a node can only be migrated at some step i if it is ready at step i. Since µ is migrated at step j, then it must be ready at that step, implying m < j. Furthermore, if m = j − 1, then S = ∅, and so, it follows m ∈ t [0] , . . . , j − 2 .
Since a node that is ready can only become executed, and a node in state executed does not change its state, µ ∈ S implies ∀i ∈ {m + 1, . . . , j + 1} , µ ∈ R i . Moreover, as µ ∈ R m+1 (p) ∩ R j (p) and m < j − 1, it follows that there is a step k ∈ {m + 1, . . . , j} such that
which, together with Lemma 2.5, contradicts Definition 2.3 -the definition of rounds.
Claim A.12. For any round t and p ∈ P rocs,
. We prove by induction on a step
Induction step To prove the induction step, assume the lemma holds for a step s l > s 0 and then prove that it also holds for s l + 1, where First, we prove that the greater is the number of processors making steal, the smaller are the chances that p's spread attempts succeeds.
Lemma C.1. Let spreads (p, α, d) be a function corresponding to the expected number of nodes that p spreads during any round, where the ratio of idle processors of the round is α and the number of processors enabling two nodes is d. Then, spreads (p, α, d) ≥ spreads (p, α, P (1 − α)).
Proof of Lemma C.1. If p targets a processor whose state flag is set to working, then its spread attempt fails. Thus, in this case p would not spread a node, regardless of the number of processors that make a spread attempt. However, if p targets a processor whose state flag is set to idle, then its attempt has a chance to succeed. We now consider the two possible situations: d = P (1 − α) -In this case, spreads (p, α, d) = spreads (p, α, P (1 − α)). d = P (1 − α) -By definition there are P (1 − α) busy processors, implying that d ≤ P (1 − α). Thus, for this case we conclude d < P (1 − α). Now, suppose p targets some processor q whose state flag is set to idle. Thanks to the synchronous environment we have artificially created, and assuming that any call to U nif ormlyRandomN umber takes the same number of steps, then every processor executes the CAS instruction -whose success dictates the success of the spread attempt -at the same step (line 30 of Algorithm 2). Finally, as a consequence of our assumptions regarding the CAS instruction (see the first paragraph of Section 4) and since processors target donees uniformly at random, the greater the number of spread attempts that target q the smaller are the chances for p's spread attempt to succeed, concluding the proof of this lemma.
Lemma C.2. Suppose there are B bins, each of which is painted either red or green, and let B R and B G denote the initial number of red and green bins, respectively. Additionally, let α denote the initial ratio of red bins, meaning α = B R B and B (1 − α) = B G . Now, suppose there are B R cubes and B G balls. First, each cube is tossed, independently and uniformly at random, into the bins. After tossing all the B R cubes, count the number of cubes that landed in green bins, and, for each such cube, a red bin is painted green. After finishing all the paintings, each of the B G balls is tossed, independently and uniformly at random, into the bins. Let Y denote the number of bins that are still red, with at least one ball. ≤ e −(1−α) .
Since the probability that no ball lands in b i is independent from the number of red bins painted green (i.e. Y i is independent from B R →G ), then, for any m, P {Y i = 0|B R →G = m} = P {Y i = 0} , and, P {Y i = 1|B R →G = m} = P {Y i = 1} . We now obtain lower bounds on the total number of spreads (or donations) made to processors during the second phase of some scheduling iteration, assuming that all busy processors make a spread attempt.
Lemma C.3. Consider any round t during a computation's execution, and let B t be the set of processors that are busy during t. If ∀p ∈ B t , |E t (p)| = 2, then E[|Spread + t (B t ) |] ≥ P α 2 t 1 − e −(1−α t ) .
Proof of Lemma C.3. We prove this result by making an analogy with Lemma C.2: 1. the number of bins B corresponds to the number of processors P ; 2. the initial ratio of red and green bins correspond, respectively, to the ratio of idle and busy processors during the round; 3. each cube toss corresponds to a steal attempt; 4. each red bin that is painted green corresponds to a processor that was idle but whose steal attempt succeeded, and thus changed its state flag to working; and 5. each ball toss corresponds to a spread attempt. Note that we can make this analogy because all steal attempts (and consequent state flag updates) take place during the first phase of scheduling iterations while all spread attempts take place during the second phase of scheduling iterations. Thus, E[|Spread + t (B t ) |] ≥ P α 2 t 1 − e −(1−α t ) .
Proof of Lemma 4.6. By Lemma C.1 it follows that E[|Spread + t (p t ) |] is the smallest if all busy processors enabled two nodes, and thus made a spread attempt. By Lemma C.3, the expected number of nodes spread during a round such that all busy processors make a spread attempt is at least
