61 research outputs found
Configurable Strategies for Work-stealing
Work-stealing systems are typically oblivious to the nature of the tasks they
are scheduling. For instance, they do not know or take into account how long a
task will take to execute or how many subtasks it will spawn. Moreover, the
actual task execution order is typically determined by the underlying task
storage data structure, and cannot be changed. There are thus possibilities for
optimizing task parallel executions by providing information on specific tasks
and their preferred execution order to the scheduling system.
We introduce scheduling strategies to enable applications to dynamically
provide hints to the task-scheduling system on the nature of specific tasks.
Scheduling strategies can be used to independently control both local task
execution order as well as steal order. In contrast to conventional scheduling
policies that are normally global in scope, strategies allow the scheduler to
apply optimizations on individual tasks. This flexibility greatly improves
composability as it allows the scheduler to apply different, specific
scheduling choices for different parts of applications simultaneously. We
present a number of benchmarks that highlight diverse, beneficial effects that
can be achieved with scheduling strategies. Some benchmarks (branch-and-bound,
single-source shortest path) show that prioritization of tasks can reduce the
total amount of work compared to standard work-stealing execution order. For
other benchmarks (triangle strip generation) qualitatively better results can
be achieved in shorter time. Other optimizations, such as dynamic merging of
tasks or stealing of half the work, instead of half the tasks, are also shown
to improve performance. Composability is demonstrated by examples that combine
different strategies, both within the same kernel (prefix sum) as well as when
scheduling multiple kernels (prefix sum and unbalanced tree search)
A Fast Causal Profiler for Task Parallel Programs
This paper proposes TASKPROF, a profiler that identifies parallelism
bottlenecks in task parallel programs. It leverages the structure of a task
parallel execution to perform fine-grained attribution of work to various parts
of the program. TASKPROF's use of hardware performance counters to perform
fine-grained measurements minimizes perturbation. TASKPROF's profile execution
runs in parallel using multi-cores. TASKPROF's causal profile enables users to
estimate improvements in parallelism when a region of code is optimized even
when concrete optimizations are not yet known. We have used TASKPROF to isolate
parallelism bottlenecks in twenty three applications that use the Intel
Threading Building Blocks library. We have designed parallelization techniques
in five applications to in- crease parallelism by an order of magnitude using
TASKPROF. Our user study indicates that developers are able to isolate
performance bottlenecks with ease using TASKPROF.Comment: 11 page
Well-Structured Futures and Cache Locality
In fork-join parallelism, a sequential program is split into a directed
acyclic graph of tasks linked by directed dependency edges, and the tasks are
executed, possibly in parallel, in an order consistent with their dependencies.
A popular and effective way to extend fork-join parallelism is to allow threads
to create futures. A thread creates a future to hold the results of a
computation, which may or may not be executed in parallel. That result is
returned when some thread touches that future, blocking if necessary until the
result is ready.
Recent research has shown that while futures can, of course, enhance
parallelism in a structured way, they can have a deleterious effect on cache
locality. In the worst case, futures can incur deviations, which implies
additional cache misses, where is the number of cache lines, is the
number of processors, is the number of touches, and is the
\emph{computation span}. Since cache locality has a large impact on software
performance on modern multicores, this result is troubling.
In this paper, however, we show that if futures are used in a simple,
disciplined way, then the situation is much better: if each future is touched
only once, either by the thread that created it, or by a thread to which the
future has been passed from the thread that created it, then parallel
executions with work stealing can incur at most additional
cache misses, a substantial improvement. This structured use of futures is
characteristic of many (but not all) parallel applications
Multigrain Affinity for Heterogeneous Work Stealing
International audienceIn a parallel computing context, peak performance is hard to reach with irregular applications such as sparse linear algebra operations. It requires dynamic adjustments to automatically balance the workload between several processors. The problem becomes even more complicated when an architecture contains processing units with radically different computing capabilities. We present a hierarchical scheduling scheme designed to harness several CPUs and a GPU. It is built on a two-level work stealing mechanism tightly coupled to a software-managed cache. We show that our approach is well suited to dynamically control heterogeneous architectures, while taking advantage of a reduction of data transfers
Extending the Nested Parallel Model to the Nested Dataflow Model with Provably Efficient Schedulers
The nested parallel (a.k.a. fork-join) model is widely used for writing
parallel programs. However, the two composition constructs, i.e. ""
(parallel) and "" (serial), are insufficient in expressing "partial
dependencies" or "partial parallelism" in a program. We propose a new dataflow
composition construct "" to express partial dependencies in
algorithms in a processor- and cache-oblivious way, thus extending the Nested
Parallel (NP) model to the \emph{Nested Dataflow} (ND) model. We redesign
several divide-and-conquer algorithms ranging from dense linear algebra to
dynamic-programming in the ND model and prove that they all have optimal span
while retaining optimal cache complexity. We propose the design of runtime
schedulers that map ND programs to multicore processors with multiple levels of
possibly shared caches (i.e, Parallel Memory Hierarchies) and provide
theoretical guarantees on their ability to preserve locality and load balance.
For this, we adapt space-bounded (SB) schedulers for the ND model. We show that
our algorithms have increased "parallelizability" in the ND model, and that SB
schedulers can use the extra parallelizability to achieve asymptotically
optimal bounds on cache misses and running time on a greater number of
processors than in the NP model. The running time for the algorithms in this
paper is , where is the cache complexity of task ,
is the cost of cache miss at level- cache which is of size ,
is a constant, and is the number of processors in an
-level cache hierarchy
Bounding Cache Miss Costs of Multithreaded Computations Under General Schedulers
We analyze the caching overhead incurred by a class of multithreaded
algorithms when scheduled by an arbitrary scheduler. We obtain bounds that
match or improve upon the well-known caching cost for the
randomized work stealing (RWS) scheduler, where is the number of steals,
is the sequential caching cost, and and are the cache size and
block (or cache line) size respectively.Comment: Extended abstract in Proceedings of ACM Symp. on Parallel Alg. and
Architectures (SPAA) 2017, pp. 339-350. This revision has a few small updates
including a missing citation and the replacement of some big Oh terms with
precise constant
- …