Well-Structured Futures and Cache Locality by Herlihy, Maurice & Liu, Zhiyu
ar
X
iv
:1
30
9.
53
01
v3
  [
cs
.D
C]
  1
0 A
pr
 20
17
Well-Structured Futures and Cache Locality
Maurice Herlihy
Computer Science Department
Brown University
mph@cs.brown.edu
Zhiyu Liu
Computer Science Department
Brown University
zhiyu liu@brown.edu
Abstract
In fork-join parallelism, a sequential program is split into a directed
acyclic graph of tasks linked by directed dependency edges, and the
tasks are executed, possibly in parallel, in an order consistent with
their dependencies. A popular and effective way to extend fork-join
parallelism is to allow threads to create futures. A thread creates a
future to hold the results of a computation, which may or may not
be executed in parallel. That result is returned when some thread
touches that future, blocking if necessary until the result is ready.
Recent research has shown that while futures can, of course,
enhance parallelism in a structured way, they can have a deleteri-
ous effect on cache locality. In the worst case, futures can incur
Ω(PT∞ + tT∞) deviations, which implies Ω(CPT∞ + CtT∞)
additional cache misses, whereC is the number of cache lines, P is
the number of processors, t is the number of touches, and T∞ is the
computation span. Since cache locality has a large impact on soft-
ware performance on modern multicores, this result is troubling.
In this paper, however, we show that if futures are used in a
simple, disciplined way, then the situation is much better: if each
future is touched only once, either by the thread that created it, or
by a thread to which the future has been passed from the thread
that created it, then parallel executions with work stealing can
incur at most O(CPT 2∞) additional cache misses, a substantial
improvement. This structured use of futures is characteristic of
many (but not all) parallel applications.
1. Introduction
Futures [18, 19] are an attractive way to structure many parallel
programs because they are easy to reason about (especially if the
futures have no side-effects) and they lend themselves well to so-
phisticated dynamic scheduling algorithms, such as work-stealing
[11] and its variations, that ensure high processor utilization. At
the same time, however, modern multicore architectures employ
complex multi-level memory hierarchies, and technology trends are
increasing the relative performance differences among the various
levels of memory. As a result, processor utilization can no longer
be the sole figure of merit for schedulers. Instead, the cache locality
of the parallel execution will become increasingly critical to overall
performance. As a result, cache locality will increasingly join pro-
[Copyright notice will appear here once ’preprint’ option is removed.]
cessor utilization as a criterion for evaluating dynamic scheduling
algorithms.
Several researchers [1, 22] have shown, however, that introduc-
ing parallelism through the use of futures can sometimes substan-
tially reduce cache locality. In the worst case, if we add futures
to a sequential program, a parallel execution managed by a work-
stealing scheduler can incur Ω(PT∞ + tT∞) deviations, which,
as we show, implies Ω(CPT∞ + CtT∞) more cache misses than
the sequential execution. Here, C is the number of cache lines, P
is the number of processors, t is the number of touches, and T∞
is the computation’s span (or critical path). As technology trends
cause the cost of cache misses to increase, this additional cost is
troubling.
This paper makes the following three contributions. First, we
show that if futures are used in a simple, disciplined way, then the
situation with respect to cache locality is much better: if each fu-
ture is touched only once, either by the thread that created it, or
by a thread to which the future has been passed directly or in-
directly from the thread that created it, then parallel executions
with work stealing can incur at most O(CPT 2∞) additional cache
misses, a substantial improvement over the unstructured case. This
result provides a simple way to identify computations for which
introducing futures will not incur a high cost in cache locality, as
well as providing guidelines for the design of future parallel com-
putations. (Informally, we think these guidelines are natural, and
correspond to structures programmers are likely to use anyway.)
We also prove that this upper bound is tight within a factor of C.
Our second contribution is to observe that when the scheduler has
a choice between running the thread that created a future, and the
thread that implements the future, running the future thread first
provides better cache locality. Finally, we show that certain varia-
tions of structured computation also have good cache locality.
The paper is organized as follows. Section 2 describes the model
for future-parallel computations. In Section 3, we describe parsi-
monious work-stealing schedulers, and briefly discuss their cache
performance measures. In Section 4, we define some restricted
forms of structured future-parallel computations. Among them, we
highlight structured single-touch computations, which, we believe,
are likely to arise naturally in many programs. In Section 5.1,
we prove that work-stealing schedulers on structured single-touch
computations incur only O(CPT 2∞) additional cache misses, if a
processor always chooses the future to execute first when it creates
that future. We also prove this bound is tight within a factor ofC. In
section 5.2, we show that if a processor chooses the current thread
over the future thread when it creates that future, then the cache lo-
cality of a structured single-touch computation can be much worse.
In Section 6, we show that some other kinds of structured future-
parallel computations also achieve relatively good cache locality.
Finally, we present conclusions in Section 7.
1 2018/9/14
2. Model
In fork-join parallelism [5, 6, 8], a sequential program is split into
a directed acyclic graph of tasks linked by directed dependency
edges. These tasks are executed in an order consistent with their
dependencies, and tasks unrelated by dependencies can be executed
in parallel. Fork-join parallelism is well-suited to dynamic load-
balancing techniques such as work stealing [1–3, 9, 11–13, 15, 18–
20].
A popular and effective way to extend fork-join parallelism is
to allow threads to create futures [4, 7, 14, 18, 19]. A future is
a data object that represents a promise to deliver the result of an
asynchronous computation when it is ready. That result becomes
available to a thread when the thread touches that future, blocking
if necessary until the result is ready. Futures are attractive because
they provide greater flexibility than fork-join programs, and they
can also be implemented effectively using dynamic load-balancing
techniques such as work stealing. Fork-join parallelism can be
viewed as a special case of future-parallelism, where the spawn
operation is an implicit future creation, and the sync operation is an
implicit touch of the untouched futures created by a thread. Future-
parallelism is more flexible than fork-join parallelism, because the
programmer has finer-grained control over touches (joins).
2.1 Computation DAG
A thread creates a future by marking an expression (usually a
method call) as a future. This statement spawns a new thread to
evaluate that expression in parallel with the thread that created the
future. When a thread needs access to the results of the compu-
tation, it applies a touch operation to the future. If the result is
ready, it is returned by the touch, and otherwise the touching thread
blocks until the result becomes ready. Without loss of generality,
we will consider fork-join parallelism to be a special case of future-
parallelism, where forking a thread creates a future, and joining one
thread to another is a touch operation.
Our notation and terminology follow earlier work [1, 3, 11, 22].
A future-parallel computation is modeled as a directed acyclic
graph (DAG). Each node in the DAG represents a task (one or
more instructions), and an edge from node u to node v represents
the dependency constraint that u must be executed before v. We
follow the convention that each node in the DAG has in-degree and
out-degree either 1 or 2, except for a distinguished root node with
in-degree 0, where the computation starts, and a distinguished final
node with out-degree 0, where the computation ends.
There are three types of edges:
• continuation edges, which point from one node to the next in
the same thread,
• future edges (sometimes called spawn edges), which point from
node u to the first node of another thread spawned at u by a
future creation,
• touch edges (sometimes called join edges), directed from a node
u in one thread t to a node v in another thread, indicating that v
touches the future computed by t.
A thread is a maximal chain of nodes connected by continuation
edges. There is a distinguished main thread that begins at the root
node and ends at the final node, and every other thread t begins at
a node with an incoming future edge from a node of the thread that
spawns t. The last node of t has only one outgoing edge which is a
touch edge directed to another thread, while other nodes of tmay or
may not have incoming and outgoing touch edges. A critical path
of a DAG is a longest directed path in the DAG, and the DAG’s
computation span is the length of a critical path.
As illustrated in Figure 1, if a thread t1 spawns a new thread t2
at node v in t1 (i.e., v has two out-going edges, a continuation edge
(a)
(b)
Figure 1. Node and thread terminology
and a future edge to the first node of t2), then we call t1 the parent
thread of t2, t2 the future thread (of t1) at v, and v the fork of t2.
A thread t3 is a descendant thread of t1 if t3 is a future thread of
t1 or, by induction, t3’s parent thread is a descendant thread of t1.
If there is a touch edge directed from node v1 in thread t1 to
node v2 in thread t2 (i.e., t2 touches a future computed by t1), and
a continuation edge directed from node u2 in t2 to v2, then we call
node v2 a touch of t1 by t2, v1 the future parent of v2, u2 the
local parent of v2, and t1 the future thread of v2. (Note that the
touch v2 is actually a node in thread t2.) We call the fork of t1 the
corresponding fork of v2.
Note that only touch nodes have in-degree 2. To distinguish
between the two types of nodes with out-degree 2, forks and future
parents of touches, we follow the convention of previous work that
the children of a fork both have in-degree 1 and cannot be touches.
In this way, a fork node has two children with in-degree 1, while a
touch’s future parent has a (touch) child with in-degree 2.
We follow the convention that when a fork appears in a DAG,
the future thread is shown on the left, and the future parent on the
right. (Note that this does not mean the future thread is chosen to
execute first at a fork.) Similarly, the future parent of a touch is
shown on the left, and the local parent on the right.
We use the following (standard) notation. Given a computation
DAG, P is the number of processors executing the computation, t
is the number of touches in the DAG, T∞, the computation span
(or critical path), is the length of the longest directed path, and C
is the number of cache lines in each processor.
2 2018/9/14
3. Work-Stealing and Cache Locality
In the paper, we focus on parsimonious work stealing algorithms
[3], which have been extensively studied [1, 3, 10, 11, 22] and used
in systems such as Cilk [9]. In a parsimonious work stealing algo-
rithm, each processor is assigned a double-ended queue (deque).
After a processor executes a node with out-degree 1, it continues
to execute the next node if the next node is ready to execute. After
the processor executes a fork, it pushes one child of the fork onto
the bottom of its deque and executes the other. When the processor
runs out of nodes to execute, it pops the first node from the bot-
tom of its deque if the deque is not empty. If, however, its deque
is empty, it steals a node from the top of the deque of an arbitrary
processor.
In our model, a cache is fully associative and consists of mul-
tiple cache lines, each of which holds the data in a memory block.
Each instruction can access only one memory block. In our anal-
ysis we focus only on the widely-used least-recently used (LRU)
cache replacement policy, but our results about the upper bounds
on cache overheads should apply to all simple cache replacement
policies [1]. 1
The cache locality of an execution is measured by the number of
cache misses it incurs, which depends on the structure of the com-
putation. To measure the effect on cache locality of parallelism,
it is common to compare cache misses encountered in a sequen-
tial execution to the cache misses encountered in various parallel
executions, focusing on the number of additional cache misses in-
troduced by parallelism.
Scheduling choices at forks affect the cache locality of execu-
tions with work stealing. After executing a fork, a processor picks
one of the two child nodes to execute and pushes the other into its
deque. For a sequential execution, whether a choice results in a bet-
ter cache performance is a characteristic of the computation itself.
For a parallel execution of a computation satisfying certain prop-
erties, however, we will show that choosing future threads (the left
children) at forks to execute first guarantees a relatively good upper
bound on the number of additional cache misses, compared to a se-
quential execution that also chooses future threads first. In contrast,
choosing the parent threads (the right children) to execute first can
result in a large number of additional cache misses, compared to a
sequential execution that also chooses parent threads first.
4. Structured Computations
Consider a sequential execution where node v1 is executed imme-
diately before node v2. A deviation [22], also called a drifted node
[1], occurs in a parallel execution if a processor P executes v2, but
not immediately after v1. For example, p might execute v1 after v2,
it might execute other nodes between v1 and v2, or v1 and v2 might
be executed by distinct processors.
[22] showed that a parallel execution of a future-parallel com-
putation with work stealing can incur Ω(PT∞ + tT∞) deviations.
This implies a parallel execution of a future-parallel computation
with work stealing can incur Ω(PT∞ + tT∞) additional cache
misses. With minor modifications in that computation (see Fig-
ure 2), a parallel execution can even incur Ω(CPT∞ + CtT∞)
additional cache misses.
Our contribution in this paper is based on the observation that
such poor cache locality occurs primarily when futures in the DAG
can be touched by arbitrary threads, resulting in unrealistic and
complicated dependencies. For example, in the worst-case DAGs
in [22] that can incur significantly high cache overheads, futures
1That is because the upper bounds in this paper are based on the results
of [1] that bound the number of drifted nodes (i.e., deviations), and those
results hold for all simple cache replacement policies, even with set asso-
ciative caches, as discussed in [1].
Figure 2. The interesting part of the bound is Ω(CtT∞). Figure 5
in [22] shows a DAG, as a building block of a worst-case compu-
tation, that can incur Ω(T∞) deviations because of one touch. We
can replace it with the DAG in Figure 2, which can incur Ω(CT∞)
additional cache misses due to one touch v (if the processor at
a fork always chooses the parent thread to execute first), so that
the worst-case computation in [22] can incur Ω(CtT∞) additional
cache misses because of t such touches. This DAG is similar to the
DAG in Figure 7(a) in this paper. The proof of Theorem 10 shows
how a parallel execution of this DAG incurs Ω(CT∞) additional
cache misses.
Figure 3. A simplified version of the DAG in [22] that can incur
high cache overhead. Here, v1 and v2 are touches. Suppose a pro-
cessor p1 executes the root node, pushes the right child x of the root
node into its deque, and then falls asleep. Now another processor
p2 steals x from p1’s deque and executes the subgraph rooted at x.
Thus, v1 and v2 will be checked (to see if they are available) even
before the corresponding future threads are spawned at u1 and u2.
are touched by threads that can be created before the future threads
computing these futures were created. As illustrated in Figure 3, a
parallel execution of such a computation can arrive at a scenario
where a thread touches a future before the future thread computing
that future has been spawned. (As a practical matter, an implemen-
tation must ensure that such a touch does not return a reference
to a memory location that has not yet been allocated.) Such sce-
narios are avoided by structured future-parallel computations (e.g.
Figure 4) that follow certain simple restrictions.
DEFINITION 1. A DAG is a structured future-parallel computation
if, (1) for the future thread t of any fork v, the local parents of the
touches of t are descendants of v, and (2) at least one touch of t is
a descendant of the right child of v.
3 2018/9/14
Figure 4. In this structured (single-touch) computation, the
touches v1 and v2 will not be checked until their corresponding
future threads have been spawned at u1 and u2, respectively.
There are two reasons we require that at least one touch of t is
a descendant of the right child of v. First, it is natural that a
computation spawns a future thread to compute a future because
the computation itself later needs that value. At the fork v, the
parent thread (the right child of v) represents the “main body” of
the computation. Hence, the future will usually be touched either
by the parent thread, or by threads spawned directly or indirectly
by the parent thread.
Second, a computation usually needs a kind of “barrier” syn-
chronization to deal with resource release at the end of the com-
putation. Some node in the future thread t, usually the last node,
should have an outgoing edge pointing to the “main body” of the
computation to tell the main body that the future thread has fin-
ished. Without such synchronization, t and its descendants will be
isolated from the main body of the computation, and we can imag-
ine a dangerous scenario where the main body of the computation
finishes and releases its resources while t or its descendant threads
are still running.
In our DAGmodel, such a synchronization point is by definition
a touch node, though it may not be a real touch. We follow the con-
vention that the thread that spawns a future thread releases it, so the
synchronization point is a node in the parent thread or one of its de-
scendants. Another possibility is to place the synchronization point
at the last node of the entire computation, which is the typically
case in languages such as Java, where the main thread of a pro-
gram is in charge of releasing resources for the entire computation.
These two styles are essentially equivalent, and should have almost
the same bounds on cache overheads. We will briefly discuss this
issue in Section 6.2.
We consider how the following constraint affects cache locality.
DEFINITION 2. A structured single-touch computation is a struc-
tured future-parallel computation where each future thread spawned
at a fork v is touched only once, and the touch node is a descendant
of v’s right child.
By the definition of threads, the future parent of the only touch of
a future thread must be the last node of the future thread (the last
node can also be a parent of a join node, but we don’t distinguish
between a touch node and a join node). The DAG in Figure 4 rep-
resents a structured single-touch computation. We will show that
work-stealing parallel executions of structured single-touch com-
putations achieve significantly less cache overheads than unstruc-
tured computations.
In principle, a future could be touched multiple times by dif-
ferent threads, so structured single-touch computations are more
restrictive structured computations in general. Nevertheless, the
single-touch constraint is one that is likely to be observed by many
programs. For example, as noted, the Cilk [9] language supports
fork-join parallelism, a strict subset of the future-parallelism model
considered here. If we interpret the Cilk language’s spawn state-
ment as creating a future, and its sync statement as touching
all untouched futures previously created by that thread, then Cilk
void MethodA {
Future x = some computation;
Future y = some computation;
a = y.touch();
b = x.touch();
}
(a)
void MethodB {
Future x = some computation;
Future y = MethodC(x);
......
}
void MethodC(Future f){
a = f.touch();
......
}
(b)
Figure 5. Two examples illustrating single-touch computations are
more flexible than fork-join computations
programs (like all fork-join programs) are structured single-touch
computations.
Structured single-touch computations encompass fork-join com-
putations, but are strictly more flexible. Figure 5 presents two ex-
amples that illustrate the differences. If a thread creates multiple
futures first and touches them later, fork-join parallelism requires
they be touched (evaluated) in the reverse order. MethodA in Fig-
ure 5(a) shows the only order in which a thread can first create two
futures and then touch them in a fork-join computation. This rules
out, for instance, a program where a thread creates a sequence of
futures, stores them in a priority queue, and evaluates them in some
priority order. In contrast, our structured computations permit such
futures to be evaluated by their creating thread or its descendants
in any order.
Also, unlike fork-join parallelism, our notion of structured com-
putation permits a thread to pass a future to another thread which
touches that future, as illustrated in Figure 5(b): after a future is cre-
ated, the future can be passed, as an argument of a new method call
or the return value of the current thread’s method call, to another
thread.2 The thread receiving the future (MethodC in the figure) can
even pass it to another thread, and so on. The only constraint is that
only one of the threads that have received the future can touch it. In
a fork-join computation, however, only the thread creating the fu-
ture can touch it, which is much more restrictive. We believe these
restrictions are easy to follow and should be compatible with how
many people program in practice.
[7] observe that if a future can be touched multiple times, then
complex and potentially inefficient operations and data structures
are needed to correctly resume the suspended threads that are wait-
2 In previous versions of this paper, we misinterpreted the definition of
structured single-touch computations by stating that a future can only be
passed as an argument of a method call, but not as the return value of a
method call. In fact, the definition only implies a dependency path from the
right child of the fork of a future, which represents the parent thread that
has created the future, to the touch node, and therefore it does not rule out
the possibility that a future can be pass as a return value from one thread to
another.
4 2018/9/14
ing for the touch. By contrast, the run-time support for futures can
be significantly simplified if each future is touched at most once.
We also consider the following structured local-touch computa-
tions in the paper.
DEFINITION 3. A structured local-touch computation is one where
each future thread spawned at a fork v is touched only at nodes
in its parent thread, and these touches are descendants of the right
child of v.
Informally, the local touch constraint implies that a thread that
needs the value of a future should create the future itself. Note that
in a structured computation with local touch constraint, a future
thread is now allowed to evaluate multiple futures and these futures
can be touched at different times. Though allowing a future thread
to compute multiple futures is not very common, [7] point out that
it can be useful for some future-parallel computations like pipeline
parallelism [7, 9, 16, 17, 21]. We will show in Section 6.1 that
work-stealing parallel executions of computations satisfying the
local touch constraint also have relatively low cache overheads.
Note that structured computations with both single touch and local
touch constraints are still a superset of fork-join computations.
5. Structured Single-Touch Computations
5.1 Future Thread First at Each Fork
We now analyze cache performance of work stealing on parallel
executions of structured single-touch computations. We will show
that work stealing has relatively low cache overhead if the proces-
sor at a fork always chooses the future thread to execute first, and
puts the parent future into its deque. For brevity, all the arguments
and results in this section assume that every execution chooses the
future thread at a fork to execute first.
LEMMA 4. In the sequential execution of a structured single-touch
computation, any touch x’s future parent is executed before x’s lo-
cal parent, and the right child of x’s corresponding fork v immedi-
ately follows x’s future parent.
Proof. By induction. Given a DAG, initially let S be an empty set
and T the set of all touches. Note that
S ∩ T = ∅ and S ∪ T = {all touches}. (1)
Consider any touch x in T , such that x has no ancestors in T .
(That is, x has no ancestor nodes that are also touches.) Let t be
the future thread of x and v the corresponding fork. Note that x’s
future parent is the last node of t by definition. When the single
processor executes v, the processor pushes v’s right child into the
deque and continues to execute thread t. By hypothesis, there are no
touches by t, since any touch by t must be an ancestor of x. There
may be some forks in t. However, whenever the single processor
executes a fork in t, it pushes the right child of that fork, which is
a node in t, into the deque and hence t (i.e., a node in t) is right
below v’s right child in the deque. Therefore, the processor will
always resume thread t before the right child of v. Since there is
no touch by t, all the nodes in t are ready to execute one by one.
Thus, when the future parent of the touch x is executed eventually,
the right child of v is right at the bottom of the deque. By the single
touch constraint, the local parent of x is a descendant of the right
child of v, so the local parent of x cannot be executed yet. Thus,
the processor will now pop the right child of v out from the bottom
of the deque. Since this node is not a touch, it is ready to execute.
Therefore, x satisfies the following two properties.
PROPERTY 5. Its future parent is executed before its local parent.
PROPERTY 6. The right child of its corresponding fork immedi-
ately follows its future parent.
Now set S = S ∪ {x} and T = T − {x}. Thus, all touches in S
satisfy Properties 5 and 6. Note that Equation 1 still holds.
Now suppose that at some point all nodes in S satisfy Proper-
ties 5 and 6, and that Equation 1 holds. Again, we now consider a
touch x in T , such that no touches in T are ancestors of x, i.e., all
the touches that are ancestors of x are in S. Since the computation
graph is a DAG, there must be such an x as long as T is not empty.
Let t be the future thread of x and v the corresponding fork. If
there are no touches by t, then we can prove x satisfies Properties 5
and 6, by the same argument for the first touch added into S. Now
assume there are touches by t. Since those touches are ancestors of
x, they are all in S and hence they all satisfy Property 5. When the
processor executes v, it pushes v’s right child into the deque and
starts executing t. Similar to what we showed above, when the pro-
cessor gets to a fork in t, it will always push t into its deque, right
below the right child of v. Thus, the processor will always resume
t before the right child of v. When the processor gets to the local
parent of a touch by t, we know the future parent of the touch has
already been executed since the touch satisfies Property 5. Thus,
the processor can immediately execute that touch and continue to
execute t. Therefore, the processor will eventually execute the fu-
ture parent of xwhile the right child of t is still the next node to pop
in the deque. Again, since the local parent of x is a descendant of
the right child of v, the local parent of x as well as x cannot be ex-
ecuted yet. Therefore, the processor will now pop the right child of
v to execute, and hence x satisfies Properties 5 and 6. Now we set
S = S∪{x} and T = T −{x}. Therefore, all touches in S satisfy
Properties 5 and 6, and Equation1) also holds. By induction, we
have S = {all touches} and all touches satisfy Properties 5 and 6.
⊓⊔
[1] have shown that the number of additional cache misses in
a work-stealing parallel computation is bounded by the product of
the number of deviations and the number of cache lines. It is easy
to see that only two types of nodes in a DAG can be deviations:
the touches and the child nodes of forks that are not chosen to
execute first. Since we assume the future thread (left child) at a
fork is always executed first, only the right children of forks can be
deviations. Next, we bound the number of deviations incurred by a
work-stealing parallel execution to bound its cache overhead.
LEMMA 7. Let t be the future thread at a fork v in a structured
single-touch computation. If t’s touch x or v’s right child u is a
deviation, then either u is stolen or there is a touch by t which is a
deviation.
Proof. By Lemma 4, a touch is a deviation if and only if its local
parent is executed before its future parent. Now suppose a processor
p executes v and pushes u into its deque. Assume that u is not
stolen and no touches by t are deviations. Thus, u will stay in p’s
deque until p pops it out. The proof of this lemma is similar to that
of Lemma 4. After p spawns thread t at v, it moves to execute t.
When p executes “ordinary” nodes in t, no nodes are pushed into
or popped out of p’s deque and hence u is still the next node in
the deque to pop. When p executes a fork in t, it pushes t (more
specifically, the right child of that fork) into its deque, right below
u. Since a thief processor always steals from the top of a deque,
and by hypothesis u is not stolen, t cannot be stolen. Thus, p will
always resume t before u and then u will become the next node in
the deque to pop. When p executes the local parent of a touch by
t, the future parent of that touch must have been executed, since
we assume that touch is not a deviation. Thus, p can continue to
execute that touch immediately and keep moving on in t with its
deque unchanged. Therefore, p will finally get to the local parent
of x and then pop u out from its deque, since x is a descendant
5 2018/9/14
of u and x cannot be execute yet. Hence, neither x nor u can be a
deviation. ⊓⊔
THEOREM 8. If, at each fork, the future thread is chosen to execute
first, then a parallel execution with work stealing incurs O(PT 2∞)
deviations and O(CPT 2∞) additional cache misses in expectation
on a structured single-touch computation, where (as usual) P is
the number of processors involved in this computation, T∞ is the
computation span, and C is the number of cache lines.
Proof. [3] have shown that in a parallel execution with work
stealing, there are in expectation O(PT∞) steals. Now let us count
how many deviations these steals can incur. A steal on the right
child u of a fork v can make u and v’s corresponding touch x1
deviations. Suppose x1 is a touch by a thread t2, then the right
child of the fork of t2 and t2’s touch x2 can be deviations. If x2 is a
deviation and x2 is a touch by another thread t3, then the right child
of the fork of t3 and t3’s touch x3 can be deviation too. Note that
x2 is a descendant of x1 and x3 is a descendant of x2. By repeating
this observation, we can find a chain of touches x1, x2, x3, ..., xn,
called a deviation chain, such that each xi and the right child of the
corresponding fork of xi can be deviations. Since for each i > 1,
xi is a descendant of x2, x1, x2, x3, . . . , xn is in a directed path
in the computation DAG. Since the length of any path is at most
T∞, we have n ≤ T∞. Since each future thread has only one
touch, there is only one deviation chain for a steal. Since there are
O(PT∞) steals in expectation in a parallel execution [3], we can
find in expectationO(PT∞) deviation chains and in totalO(PT
2
∞)
touches and right children of the corresponding forks involved, i.e.,
O(PT 2∞) deviations involved.
Next, we prove by contradiction that no other touches or right
children of forks can be deviations. suppose there is touch y, such
that y or the right child of the corresponding fork of y is a deviation,
and that y is not in any deviation chain. The right child of the
corresponding fork of y can not be stolen, since by hypothesis y
is not the first touch in any of those chains. Thus by Lemma 7,
there is a touch y′ by the future thread of y and y′ is a deviation.
Note that y′s cannot be in any deviation chain either. Otherwise y
and the deviation chain y′ is in will form a deviation chain too, a
contradiction. Therefore, by repeating such “tracing back”, we will
end up at a deviation touch that is not in any deviation chain and has
no touches as its ancestors. Therefore, there are no touches by the
future thread of this touch, and the right child of the corresponding
future fork of it is not stolen, contradicting Lemma 7.
The upper bound on the expected number of additional cache
misses follows from the result of [1] that the number of additional
cache misses in a work-stealing parallel computation is bounded by
the product of the number of deviations and the number of cache
lines. ⊓⊔
The bound on the number of deviations in Theorem 8 is tight,
and the bound on the number of additional cache misses is tight
within a factor of C, as shown below in Theorem 9.
THEOREM 9. If, at each fork node, the future thread is chosen
to execute first, then a parallel execution with work stealing can
incur Ω(PT 2∞) deviations and Ω(PT
2
∞) additional cache misses
on a structured single-touch computation, while the sequential
execution of this computation incurs O(PT 2∞/C) cache misses.
Proof. Figure 6(c) shows a computation DAG on which we can
get the bounds we want to prove. The DAG in Figure 6(c) uses
the DAGs in Figures 6(a) and 6(b) as building blocks. Let’s look
at Figures 6(a) first. Suppose there are two processors p1 and p2
executing the DAG in Figure 6(a). Suppose p2 executes v, pushes
u1 into its deque, and then falls asleep before executing w. Now
(a) (b)
(c)
Figure 6. Figure (c) shows a DAG on which work stealing can
incur Ω(PT 2∞) deviations and Ω(PT
2
∞) additional cache misses.
It uses the DAGs in (a) and (b) as building blocks.
suppose p1 steals u1. For each i ≤ k, neither si nor Zi can be
executed since w has not been executed yet. Now p1 takes a solo
run, executing u1, x1, Y1, u2, x2, Y2, ..., xk, Yk. After p1 finishes,
p2 wakes up and executes the rest of the computation DAG. Note
that the right (local) parent of si is executed before the left (future)
parent of the touch is executed. Thus, by Lemma 4, each si is a
deviation. Hence, this parallel execution incurs k deviations and
the computation span of the computation is Θ(k).
Now let us consider a parallel execution of the computation
in 6(b). For each i ≤ k, the subgraph rooted at vi is identical
to the computation DAG in 6(a) (except that the last node of the
subgraph has an extra edge pointing to a node of the main thread).
Suppose there are three processors p1, p2, and p3 working on the
computation. Assume p2 executes r1 and v1 and then falls asleep
when it is about to execute w. p3 now steals r2 from p2 and then
falls asleep too. Then p1 steals u1 from p2’s deque. Now p1 and p2
execute the subgraph rooted at v1 in the same way they execute the
DAG in 6(a). After p1 and p2 finish, p3 wakes up, executes r2. Now
these three processors start working on the subgraph rooted at r3 in
the same way they executed the graph rooted at r1. By repeating
this, the execution ends up incurring k2 deviations when all the k
subgraphs are done. Since the length of the path r1, r2, r3... on the
right-hand side is Θ(k), the computation span of the DAG is still
Θ(k).
Now we construct the final computation DAG, as in Figure 6(c).
The “top” nodes of the DAG are all forks, each spawning a future
thread. Thus, they form a binary tree and the number of threads
increase exponentially. The DAG stops creating new threads at
level Θ(log n) when it has n threads rooted at S1, S2, ..., Sn,
respectively. For each i, the subgraph rooted at Si is identical to
6 2018/9/14
the DAG in 6(b). Suppose there are 3n processors working on
the computation. It is easy to see n processors can eventually get
to S1, S2, ..., Sn. Suppose they all fall asleep immediately after
executing the first two nodes of Si(corresponding to r1 and v1 in
Figure 6(b)) and then each two of the rest 2n free processors join to
work on the subgraph rooted at Si, in the same way p1, p2 and p3
did in Figure 6(b). Therefore, this execution will finally incur nk2
deviations, while the computation span of the DAG isΘ(k+logn).
Therefore, by setting n = P/3, we get a parallel execution that
incurs Ω(PT 2∞) deviations, when logP = O(k).
To get the bound on the number of additional cache misses,
we just need to modify the graph in 6(a) as follows. For each
1 ≤ i ≤ k, Yi consists of a chain of C nodes yi1, yi2, ..., yiC ,
whereC is the number of cache lines. yi1, yi2, ..., yiC access mem-
ory blocks m1, m2, ..., mC , respectively. Similarly, each Zi con-
sists of a chain of C nodes zi1, zi2, ..., ziC . zi1, zi2, ..., ziC ac-
cess memory blocksmC ,mC−1, ..., m1, respectively. all si access
memory block mC . For all 1 ≤ i ≤ k, ui and xi both access
memory block mC+1. It does not matter which memory blocks
the other nodes in the DAG access. For simplicity, assume the
other nodes do not access memory. In the sequential execution, the
single processor has m1,m2, ..., mC in its cache after executing
v, w, u1, x1, Y1, Z1 and it has incurred (C + 1) cache misses so
far. Now it executes u2 and x2, incurring one cache miss at node
u2 by replacing mC withmC+1 in its cache, since mC is the least
recently used block. When it executes Y2 and Z2, it only incurs
one cache miss by replacing mC+1 with mC at the last node of
Y2, y2C . Likewise, it is easy to see that the sequential execution
will only incur cache misses at nodes ui and at the last nodes of
Yi for all i. Hence, the sequential execution incurs only O(k + C)
cache misses. When k = Ω(C), the sequential execution incurs
only O(k) cache misses.
Now consider the parallel execution by two processors p1 and
p2 we described before. p2 will incur only C cache misses, since
Zi and si only access m different blocks m1,m2, ..., mC and
hence p2 doesn’t need to swap any memory blocks out of its
cache. However, p1 will incur lots of cache misses. After executing
each Yi, p1 will execute ui+1. Thus at ui+1, one cache miss is
incurred and m1 is replaced with mC+1, since m1 is the least
recently used block. Then, when p1 executes the first node y(i+1)1
in Yi, , m1 is not in its cache. Since m2 now becomes the least
recently used memory block in p1’s cache, m2 is replaced by m1.
Thus, m2 will not be in the cache when it is in need at y(i+1)2.
Therefore, it is obvious that p1 will incur a cache miss at each
node in Yi and hence incur Ck cache misses in total in the entire
execution. Note that the computation span of this modified DAG is
Θ(Ck), since each Zi now has C nodes. Therefore, the sequential
execution and the parallel execution actually incur Θ(T∞/C) and
Θ(T∞), respectively, when logP = O(k). Therefore, if we use
this modified DAG as the building blocks in 6(c), we will get
the bound on the number of additional cache misses stated in the
theorem. ⊓⊔
5.2 Parent Thread First at Each Fork
In this section, we show that if the parent thread is always exe-
cuted first at a fork, a work-stealing parallel execution of a struc-
tured single-touch computation can incur Ω(tT∞) deviations and
Ω(CtT∞) additional cache misses, where t is the number of
touches in the computation, while the corresponding sequential
execution incurs only a small number of cache misses. This bound
matches the upper bound for general, unstructured future-parallel
computations [22]2. This result, combined with the result in Sec-
2The bound on the expected number of deviations in [22] is actually
O(PT∞ + tT∞). However, as pointed out in [22], a simple fork-join
tion 5.1, shows that choosing the future threads at forks to execute
first achieves better cache locality for work-stealing schedulers on
structured single-touch computations.
THEOREM 10. If, at each fork, the parent thread is chosen to
execute first, then a parallel execution with work stealing can
incur Ω(tT∞) deviations and Ω(CtT∞) additional cache misses
on a structured single-touch computation, while the sequential
execution of this computation incurs only O(C + t) cache misses.
Proof. The final DAG we want to construct is in Figure 8. It uses
the DAGs in Figure 7 as building blocks. We first describe how
a single deviation at a touch u3 can incur Ω(T∞) deviations and
Ω(CT∞) additional cache misses in Figure 7(a). In order to get the
bound we want to prove, here we follow the convention in [1, 22]
to distinguish between touches and join nodes in the DAG. More
specifically, yi is a join node, not a touch, for each 1 ≤ i ≤ n.
For each 1 ≤ i ≤ n, node xi accesses memory block m1 and
yi accesses memory block mC+1. Zi consists of a chain of C
nodes zi1, zi2, ..., ziC , accessing memory blocks m1, m2, ..., mC
respectively. All the other nodes do not access memory. Assume in
the sequential execution a single processor p1 executes the entire
DAG in Figure 7(a). Suppose initially the left (future) parent of
u3 has already been executed. p1 starts executing the DAG at u1.
Since p1 always stays on the parent thread at a fork, it first pushes
s into its deque, continues to execute u2, u3, u4, and then executes
x1, x2, ..., xn while pushing z11, z21, ..., zn1 into its deque. Since
v cannot be executed due to s, p1 pops zn1 out of its deque
and executes the nodes in Zn. Then p1 executes all the nodes
in Zn−1, Zn−2, ..., Z1, in this order. So far p1 has only incurred
C cache misses, since all the nodes it has executed only access
memory blocks m1, ..., mC and hence it did not need to swap any
memory blocks out of its cache. Now p1 executes s, v and then
yn, yn−1, ..., y1, incurring only one more cache miss by replacing
m1 with mC+1 at yn. Hence, this execution incurs O(C) cache
misses in total. Note that the left parent of yi is executed before the
right parent yi for all i.
Now assume in another execution by p1, the left parent of
u3 is in p1’s deque when p1 starts executing u1. Thus, u3 is
a deviation with respect to the previous execution. Since u3 is
not ready to execute after p1 executes u2, p1 pops s out of its
deque to execute. Since v is not ready, p1 now pops the left parent
of u3 to execute and then executes u3, u4, x1, x2, ..., xn, v. Now
p1 pops zn1 out and executes all the nodes Zn. Note that yn is
now ready to execute and the memory blocks in p1’s cache at the
moment are m1,m2, ..., mC . Now p1 executes yn, replacing the
least recently used blockm1 withmC+1. p1 then pops z(n−1)1 out
and executes all the nodes z(n−1)1, z(n−1)2, ..., z(n−1)C in Zn−1
one by one. When p1 executes z(n−1)1, it replaces m2 with m1,
and when it executes z(n−1)2, it replaces m3 with m2, and so on.
The same thing happens to all Zi and yi. Thus, p1 will incur a
cache miss at every node afterwards, ending up with Ω(Cn) cache
misses in total. Note that the computation span of this DAG is
T∞ = Θ(C + n). Thus, this execution with a deviation at u3
incurs Ω(CT∞) cache misses when n = Ω(C). Moreover, all yi
are deviations and hence this execution incurs Ω(T∞) deviations.
Now let us see how a single steal at the beginning of a thread re-
sults in Ω(T∞) deviations and Ω(CT∞) cache misses at the end of
the thread. Figure 7(b) presents such a computation. First we con-
sider the sequential execution by a processor p1. It is easy to see p1
executes nodes in the order r, u1, w1, s2, s1, v1, u2,w2, v2, u3, w3,
s4, s3, v3, u4, .... The key observation is that wi is executed before
si is executed for any odd-numbered i while wi is executed after si
computation can get Ω(PT∞) deviations. Hence we focus on the more
interesting part Ω(tT∞).
7 2018/9/14
(a)
(b)
Figure 7. DAGs used by Figure 8 as building blocks.
is executed for any even-numbered i. This statement can be proved
by induction. Obviously, this holds for i = 1 and i = 2, as we
showed before. Now suppose this fact holds for all 1, 2, ..., i, for
some even-numbered i. Now suppose p1 executes ui−1. Then p1
pushes si into its deque and executes wi−1. Since we know wi−1
should be executed before si−1, si−1 has not been executed yet.
Moreover, si−1 must already be in the deque before si was pushed
into the deque, since si−1’s parent ui−2 has been executed and
si−1 is ready to execute. Now p1 pops si out to execute. Since
vi is not ready to execute, p1 pops si−1 out and then executes
si−1, vi−1, ui, and pushes si+1 into the deque. Now p1 continues
to execute wi, vi, ui+1 and pushes si+1 into its deque. Then pi ex-
ecutes wi+1 and pops si+2 out, since vi+1 is not ready due to si+1.
Now we can see wi+1 and si+2 have been executed, but si+1 and
wi+2 not yet. That is, wi+1 is executed before si+1 and wi+2 is
executed after si+2. Therefore, the statement holds for i + 1 and
i+ 2, and hence the proof completes.
The subgraph rooted at uk is identical to the graph in Fig-
ure 7(a), with vk corresponding to u3 in Figure 7(a). Therefore,
if k is an even number, vk’s left parent has been executed when
wk is executed and hence the sequential execution will incur only
O(C) cache misses on the subgraph rooted at uk.
Now consider the following parallel execution of the DAG in
Figure 7(b) by two processors p1 and p2. p1 first executes r and
pushes s1 into its deque. Then p2 immediately steals s1 and ex-
ecutes it. Now p2 falls asleep, leaving p1 executing the rest of the
DAG alone. It is easy to see p1 will execute the nodes in the DAG in
the order u1, w1, v1, u2, w2, s3, s2, v2, u3, w3, v3, u4, s4, ... It can
be proved by induction that wi is executed after si is executed for
any odd-numbered i while wi is executed before si is executed for
any even-numbered i, which is opposite to the order in the sequen-
tial execution. The induction proof is similar to that of the previous
observation in the sequential execution, so we omit the proof here.
If k is an even number, wk will be executed before the left parent
of vk and hence this execution will incur Ω(T∞) deviations and
Ω(CT∞) cache misses when n = Ω(C) and n = Ω(k).
The final DAG we want to construct is in Figure 8. This is
actually a generalization of the DAG in Figure 7(b). Instead of
having one fork ui before each touch vi, it has two forks ui and
xi, for each i. After each touch vi, the thread at yi splits into
two identical branches, touching the futures spawned at ui and xi,
respectively. In this figure, we only depict the right branch and omit
the identical left branch. As we can see, the right branch later has
a touch vi+1 touching the future si+1 spawned at the fork xi. If
we only look at the thread on the right-hand side, it is essentially
the same as the DAG in Figure7(b). The sequential execution of
this DAG by p1 is similar to that in Figure7(b). The only difference
is that p1 at each yi will execute the right branch first and then
the left branch recursively. Similarly, it can be proved by induction
that wi is executed before si is executed for any odd-numbered i
while wi is executed after si is executed for any even-numbered
i. Obviously this also holds for each left branch. Now consider a
parallel execution by two processors p1 and p2. p1 first executes
r. p2 immediately steals s1 and executes it and then sleeps forever.
Now p1 makes a solo run to execute the rest of the DAG. Again, we
can prove by the same induction argument that wi is executed after
si is executed for any odd-numbered i while wi is executed before
si is executed for any even-numbered i, which is opposite to the
order in the sequential execution. The above two induction proofs
are a little more complicated than those for the DAG in Figure7(b),
but the ideas are essentially the same (the only difference is now
we have to prove the statements hold for the two identical branches
split at fork yi at the inductive step) and hence we omit the proofs
again.
8 2018/9/14
Figure 8. A DAG on which work stealing can incur Ω(tT∞)
deviations and Ω(CtT∞) if it chooses parents threads to execute
first at forks. This example uses the DAGs in Figure 7 as building
blocks.
By splitting each thread into two after each yi, the number of
branches in the DAG increases exponentially. Suppose there are t
touches in the DAG. Thus, there are eventually Θ(t) branches and
the height of this structure is Θ(log t). At the end of each branch
is a subgraph identical to the DAG in Figure 7(a). Therefore, the
parallel execution with only one steal can end up incurring Θ(tn)
deviations and Θ(Ctn) cache misses. The sequential execution
incurs only Θ(C + t) cache misses, since the sequential execution
will incur only 2 cache misses by swapping mC+1 in and out at
each branch, after it incurs C cache misses to loadm1,m2, ..., mC
at the first branch. hence, when n = Ω(log t) and n = Ω(C), we
get the bound stated in the theorem. ⊓⊔
6. Other Kinds of Structured Computations
It is natural to ask whether other kinds of structured computations
can also achieve relatively good cache locality. We now consider
two alternative kinds of restrictions.
6.1 Structured Local-Touch Computations
In this section, we prove that work-stealing parallel executions
of structured local-touch computations also have relatively good
cache locality, if the future thread is chosen to execute first at
each fork. This result, combined with Theorems 8 and 10, implies
that work-stealing schedulers for structured computations are likely
better off choosing future threads to execute first at forks.
LEMMA 11. In the sequential execution of a structured local-touch
computation where the future thread at a fork is always chosen to
execute first, any touch x’s future parent is executed before x’s local
parent, and the right child of any fork v immediately follows the last
node of the future thread spawned at v, i.e., the future parent of the
last touch of the future thread.
The proof is omitted because it is almost identical to that of
Lemma 4. (We first consider a future thread whose touches are the
“earliest” in the DAG, that is, no other touches are ancestors of
them, and we can easily prove the statement in Lemma 11 holds
for those touches. Then by the same induction proof as for Lemma
4, we can prove the statement holds for all future threads’ touches.)
THEOREM 12. If the future thread at a fork is always chosen to
execute first, then a parallel execution with work stealing incurs
O(PT 2∞) deviations and O(CPT
2
∞) additional cache misses in
expectation on a structured local-touch computation.
Proof. Let v be a fork that spawns a future thread t. Now we con-
sider a parallel execution. Let p be a processor that executes v and
pushes the right child of v into its deque. Suppose the right child
of v is not stolen. Now consider the subgraph G′ consisting of t
and its descendant threads. Note that G′ itself is a structured com-
putation DAG with local touch constraint. Now p starts executing
G′.
According to local touch constraint, the only nodes outside G′
that connect to the nodes in G′ are v and the touches of t, and c
is the only node outside G′ that the nodes in G′ depend on. Now
v has been executed and the touches of t are not ready to execute
due to the right child of v. Hence, p is able to make a sequential
execution on G′ without waiting for any node outside to be done
or jumping to a node outside, as long as no one steals a node in
G′ from p’s deque. Since we assume the right child of v will not
be stolen and any nodes in G′ can only be pushed into p’s deque
below v, no nodes in G′ can be stolen. Hence, G′ will be executed
by a sequential execution by p. Therefore, there are no deviations
in G′. After p executed the last node in G′, which is the last node
in t, p pops the right child of v to execute. Hence, the right child
of v cannot be a deviation either, if it is not stolen. That is, those
nodes can be deviations only if the right child of v is stolen. Since
there are in expectationO(PT∞) steals in an parallel execution and
each future thread has at most T∞ touches, the expected number of
deviations is bounded by O(PT 2∞) and the expected number of
additional touches is bounded by O(CPT 2∞). ⊓⊔
6.2 Structured Computations with Super Final Nodes
As discussed in Section 4, in languages such as Java, the program’s
main thread typically releases all resources at the end of an execu-
tion. To model this structure, we add an edge from the last node
of each thread to the final node of the computation DAG. Thus,
the final node becomes the only node with in-degree greater than
2. Since the final node is always the last to execute, simply adding
those edges pointing to the final node into a DAG will not change
the execution order of the nodes in the DAG. It is easy to see that
having such a super node will not change the upper bound on the
cache overheads of the work-stealing parallel executions of a struc-
tured computation.
For structured computations with super final nodes, it also
makes sense to slightly relax the single-touch constraint as follows.
DEFINITION 13. A structured single-touch computation with a su-
per final node is one where each future thread t at a fork v has at
9 2018/9/14
least one and at most two touches, a descendant of v’s right child
and the super final node.
In such a computation, a future thread can have the super final
node as its only touch. This structure corresponds to a program
where one thread forks another thread to accomplish a side-effect
instead of computing a value. The parent thread never touches the
resulting future, but the computation as a whole cannot terminate
until the forked thread completes its work.
Now we show that the parallel executions of structured single-
touch computations with super final nodes also have relatively low
cache overheads.
LEMMA 14. In the sequential execution of a structured single-
touch computation with a super final node, where the future thread
at a fork is always chosen to execute first, any touch x’s future
parent is executed before x’s local parent, and the right child u of
any fork v immediately follows the last node of the future thread
spawned at v, i.e., the future parent of the last touch of the future
thread.
LEMMA 15. Let t be the future thread at a fork v in a structured
single-touch computation with a super final node. If a touch of t or
v’s right child u is a deviation, then either u is stolen or there is a
touch by t which is a deviation.
Proof. The proofs of Lemma 4 and Lemma 7, with only minor
modifications, also apply to the above two lemmas, respectively.
That is because introducing the super final node into a computation
doesn’t affect the order in which other nodes are executed, since no
other nodes need to wait for the super final node and the super final
node is always the last node to execute. More specifically, when
a processor executing any thread t reaches a node that is a parent
of the super final node, the processor will continue to work on t
if that node is not the last node of t, and otherwise try popping a
node out of its deque. Therefore, by the same proof techniques as
for Lemmas 4 and 7, we can show that a processor will execute the
right child u of a fork v and the parents of the touches of the future
spawned at v in the order stated in Lemmas 14 and 15. ⊓⊔
THEOREM 16. If, at each fork, the future thread is chosen to ex-
ecute first, then a parallel execution with work stealing incurs
O(PT 2∞) deviations and O(CPT
2
∞) additional cache misses in
expectation on a structured single-touch computation with a super
final node.
Proof. The proof is similar to that of Theorem 8. The only dif-
ference is that if a touch by a thread t is a deviation, now the
two touches of t can both be deviations, which could be a trou-
ble for constructing the deviation chains. Fortunately, one of these
two touches is the super final node, which is always the last node
to execute and hence will not make the touches of other threads
become deviations. Therefore, we can still get a unique deviation
chain starting from a steal and hence the proof of Theorem 8 still
applies here. ⊓⊔
Similarly, we can also introduce a super final node to a struc-
tured local-touch computation as follows.
DEFINITION 17. A structured local-touch computation with a su-
per final node is one where each future thread t spawned at a fork
v can be touched only by the super final node and by t’s parent
thread at nodes that are descendants of the right child of v.
It is obvious that by the same proof as for Theorem 12, we can
prove the following bounds.
THEOREM 18. If the future thread at a fork is always chosen to
execute first, then a parallel execution with work stealing incurs
O(PT 2∞) deviations and O(CPT
2
∞) additional cache misses in
expectation on a structured local-touch computation with a super
final node.
7. Conclusions
We have focused primarily on structured single-touch computa-
tions, in which futures are used in a restricted way. We saw that for
such computations, a parallel execution by a work-stealing sched-
uler that runs future threads first can incur at most O(CPT 2∞)
cache misses more than the corresponding sequential execution, a
substantially better cache locality than the Ω(CPT∞ + CtT∞)
worst-case additional cache misses possible with unstructured use
of futures. Although we cannot prove this claim formally, we think
that these restrictions correspond to program structures that would
occur naturally anyway in many (but not all) parallel programs that
use futures. For example, Cilk [9] programs are structured single-
touch computations, and that [7] observe that the single-touch re-
quirement substantially simplifies implementations.
We also considered some alternative restrictions on future use,
such as structured local-touch computations, and structured com-
putations with super final nodes, that also incur a relatively low
cache-locality penalty. In terms of future work, we think it would
be promising to investigate how far these restrictions can be weak-
ened or modified while still avoiding a high cache-locality penalty.
We would also like to understand how these observations can be
exploited by future compilers and run-time systems.
References
[1] Umut A. Acar, Guy E. Blelloch, and Robert D. Blumofe. The data
locality of work stealing. In Proceedings of the twelfth annual ACM
symposium on Parallel algorithms and architectures, SPAA ’00, pages
1–12, New York, NY, USA, 2000. ACM.
[2] Kunal Agrawal, Yuxiong He, and Charles E. Leiserson. Adaptive work
stealing with parallelism feedback. In Proceedings of the 12th ACM SIG-
PLAN symposium on Principles and practice of parallel programming,
PPoPP ’07, pages 112–120, New York, NY, USA, 2007. ACM.
[3] Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread
scheduling for multiprogrammed multiprocessors. In Proceedings of the
tenth annual ACM symposium on Parallel algorithms and architectures,
SPAA ’98, pages 119–129, New York, NY, USA, 1998. ACM.
[4] Arvind, Rishiyur S. Nikhil, and Keshav K. Pingali. I-structures: data
structures for parallel computing. ACM Trans. Program. Lang. Syst.,
11(4):598–632, October 1989.
[5] Guy E. Blelloch. Programming parallel algorithms. Commun. ACM,
39(3):85–97, March 1996.
[6] Guy E. Blelloch, Phillip B. Gibbons, and Yossi Matias. Provably ef-
ficient scheduling for languages with fine-grained parallelism. In Pro-
ceedings of the seventh annual ACM symposium on Parallel algorithms
and architectures, SPAA ’95, pages 1–12, New York, NY, USA, 1995.
ACM.
[7] Guy E. Blelloch and Margaret Reid-Miller. Pipelining with futures. In
Proceedings of the ninth annual ACM symposium on Parallel algorithms
and architectures, SPAA ’97, pages 249–259, New York, NY, USA,
1997. ACM.
[8] Robert D. Blumofe, Matteo Frigo, Christopher F. Joerg, Charles E.
Leiserson, and Keith H. Randall. An analysis of dag-consistent dis-
tributed shared-memory algorithms. In Proceedings of the eighth annual
ACM symposium on Parallel algorithms and architectures, SPAA ’96,
pages 297–308, New York, NY, USA, 1996. ACM.
[9] Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul,
Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: an efficient
multithreaded runtime system. In Proceedings of the fifth ACM SIGPLAN
10 2018/9/14
symposium on Principles and practice of parallel programming, PPOPP
’95, pages 207–216, New York, NY, USA, 1995. ACM.
[10] Robert D. Blumofe and Charles E. Leiserson. Space-efficient schedul-
ing of multithreaded computations. SIAM J. Comput., 27(1):202–229,
February 1998.
[11] Robert D. Blumofe and Charles E. Leiserson. Scheduling mul-
tithreaded computations by work stealing. J. ACM, 46(5):720–748,
September 1999.
[12] F. Warren Burton and M. Ronan Sleep. Executing functional programs
on a virtual tree of processors. In Proceedings of the 1981 conference on
Functional programming languages and computer architecture, FPCA
’81, pages 187–194, New York, NY, USA, 1981. ACM.
[13] David Chase and Yossi Lev. Dynamic circular work-stealing deque. In
Proceedings of the seventeenth annual ACM symposium on Parallelism
in algorithms and architectures, SPAA ’05, pages 21–28, New York, NY,
USA, 2005. ACM.
[14] Matthew Fluet, Mike Rainey, John Reppy, and Adam Shaw.
Implicitly-threaded parallelism in manticore. In Proceedings of the 13th
ACM SIGPLAN international conference on Functional programming,
ICFP ’08, pages 119–130, New York, NY, USA, 2008. ACM.
[15] Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The imple-
mentation of the cilk-5 multithreaded language. In Proceedings of the
ACM SIGPLAN 1998 conference on Programming language design and
implementation, PLDI ’98, pages 212–223, New York, NY, USA, 1998.
ACM.
[16] John Giacomoni, Tipp Moseley, and Manish Vachharajani. Fastfor-
ward for efficient pipeline parallelism: a cache-optimized concurrent
lock-free queue. In Proceedings of the 13th ACM SIGPLAN Symposium
on Principles and practice of parallel programming, PPoPP ’08, pages
43–52, New York, NY, USA, 2008. ACM.
[17] Michael I. Gordon, William Thies, and Saman Amarasinghe. Exploit-
ing coarse-grained task, data, and pipeline parallelism in stream pro-
grams. In Proceedings of the 12th international conference on Archi-
tectural support for programming languages and operating systems, AS-
PLOS XII, pages 151–162, New York, NY, USA, 2006. ACM.
[18] Robert H. Halstead, Jr. Implementation of multilisp: Lisp on a mul-
tiprocessor. In Proceedings of the 1984 ACM Symposium on LISP and
functional programming, LFP ’84, pages 9–17, New York, NY, USA,
1984. ACM.
[19] Robert H. Halstead, Jr. Multilisp: a language for concurrent symbolic
computation. ACM Trans. Program. Lang. Syst., 7(4):501–538, October
1985.
[20] D. A. Kranz, R. H. Halstead, Jr., and E. Mohr. Mul-t: a high-
performance parallel lisp. In Proceedings of the ACM SIGPLAN
1989 Conference on Programming language design and implementation,
PLDI ’89, pages 81–90, New York, NY, USA, 1989. ACM.
[21] I-Ting Angelina Lee, Charles E. Leiserson, Tao B. Schardl, Jim Sukha,
and Zhunping Zhang. On-the-fly pipeline parallelism. In Proceedings of
the 25th ACM symposium on Parallelism in algorithms and architectures,
SPAA ’13, pages 140–151, New York, NY, USA, 2013. ACM.
[22] Daniel Spoonhower, Guy E. Blelloch, Phillip B. Gibbons, and Robert
Harper. Beyond nested parallelism: tight bounds on work-stealing over-
heads for parallel futures. In Proceedings of the twenty-first annual sym-
posium on Parallelism in algorithms and architectures, SPAA ’09, pages
91–100, New York, NY, USA, 2009. ACM.
11 2018/9/14
