Search CORE

61 research outputs found

Configurable Strategies for Work-stealing

Author: Cederman Daniel
Träff Jesper Larsson
Tsigas Philippas
Wimmer Martin
Publication venue
Publication date: 01/01/2013
Field of study

Work-stealing systems are typically oblivious to the nature of the tasks they are scheduling. For instance, they do not know or take into account how long a task will take to execute or how many subtasks it will spawn. Moreover, the actual task execution order is typically determined by the underlying task storage data structure, and cannot be changed. There are thus possibilities for optimizing task parallel executions by providing information on specific tasks and their preferred execution order to the scheduling system. We introduce scheduling strategies to enable applications to dynamically provide hints to the task-scheduling system on the nature of specific tasks. Scheduling strategies can be used to independently control both local task execution order as well as steal order. In contrast to conventional scheduling policies that are normally global in scope, strategies allow the scheduler to apply optimizations on individual tasks. This flexibility greatly improves composability as it allows the scheduler to apply different, specific scheduling choices for different parts of applications simultaneously. We present a number of benchmarks that highlight diverse, beneficial effects that can be achieved with scheduling strategies. Some benchmarks (branch-and-bound, single-source shortest path) show that prioritization of tasks can reduce the total amount of work compared to standard work-stealing execution order. For other benchmarks (triangle strip generation) qualitatively better results can be achieved in shorter time. Other optimizations, such as dynamic merging of tasks or stealing of half the work, instead of half the tasks, are also shown to improve performance. Composability is demonstrated by examples that combine different strategies, both within the same kernel (prefix sum) as well as when scheduling multiple kernels (prefix sum and unbalanced tree search)

arXiv.org e-Print Archive

Chalmers Research

Chalmers Publication Library

A Fast Causal Profiler for Task Parallel Programs

Author: Nagarakatte Santosh
Yoga Adarsh
Publication venue
Publication date: 02/07/2017
Field of study

This paper proposes TASKPROF, a profiler that identifies parallelism bottlenecks in task parallel programs. It leverages the structure of a task parallel execution to perform fine-grained attribution of work to various parts of the program. TASKPROF's use of hardware performance counters to perform fine-grained measurements minimizes perturbation. TASKPROF's profile execution runs in parallel using multi-cores. TASKPROF's causal profile enables users to estimate improvements in parallelism when a region of code is optimized even when concrete optimizations are not yet known. We have used TASKPROF to isolate parallelism bottlenecks in twenty three applications that use the Intel Threading Building Blocks library. We have designed parallelization techniques in five applications to in- crease parallelism by an order of magnitude using TASKPROF. Our user study indicates that developers are able to isolate performance bottlenecks with ease using TASKPROF.Comment: 11 page

arXiv.org e-Print Archive

Crossref

Well-Structured Futures and Cache Locality

Author: Herlihy Maurice
Liu Zhiyu
Publication venue
Publication date: 16/08/2016
Field of study

In fork-join parallelism, a sequential program is split into a directed acyclic graph of tasks linked by directed dependency edges, and the tasks are executed, possibly in parallel, in an order consistent with their dependencies. A popular and effective way to extend fork-join parallelism is to allow threads to create futures. A thread creates a future to hold the results of a computation, which may or may not be executed in parallel. That result is returned when some thread touches that future, blocking if necessary until the result is ready. Recent research has shown that while futures can, of course, enhance parallelism in a structured way, they can have a deleterious effect on cache locality. In the worst case, futures can incur

\Omega(P T_\infty + t T_\infty)

deviations, which implies

\Omega(C P T_\infty + C t T_\infty)

additional cache misses, where

C

is the number of cache lines,

P

is the number of processors,

t

is the number of touches, and

T_\infty

is the \emph{computation span}. Since cache locality has a large impact on software performance on modern multicores, this result is troubling. In this paper, however, we show that if futures are used in a simple, disciplined way, then the situation is much better: if each future is touched only once, either by the thread that created it, or by a thread to which the future has been passed from the thread that created it, then parallel executions with work stealing can incur at most

O(C P T^2_\infty)

additional cache misses, a substantial improvement. This structured use of futures is characteristic of many (but not all) parallel applications

arXiv.org e-Print Archive

CiteSeerX

Multigrain Affinity for Heterogeneous Work Stealing

Author: Carribault Patrick
Cohen Albert
Vet Jean-Yves
Publication venue: HAL CCSD
Publication date: 23/01/2012
Field of study

International audienceIn a parallel computing context, peak performance is hard to reach with irregular applications such as sparse linear algebra operations. It requires dynamic adjustments to automatically balance the workload between several processors. The problem becomes even more complicated when an architecture contains processing units with radically different computing capabilities. We present a hierarchical scheduling scheme designed to harness several CPUs and a GPU. It is built on a two-level work stealing mechanism tightly coupled to a software-managed cache. We show that our approach is well suited to dynamically control heterogeneous architectures, while taking advantage of a reduction of data transfers

INRIA a CCSD electronic archive server

HAL-CEA

Extending the Nested Parallel Model to the Nested Dataflow Model with Provably Efficient Schedulers

Author: Dinh David
Simhadri Harsha Vardhan
Tang Yuan
Publication venue
Publication date: 14/02/2016
Field of study

The nested parallel (a.k.a. fork-join) model is widely used for writing parallel programs. However, the two composition constructs, i.e. "

\parallel

" (parallel) and "

;

" (serial), are insufficient in expressing "partial dependencies" or "partial parallelism" in a program. We propose a new dataflow composition construct "

\leadsto

" to express partial dependencies in algorithms in a processor- and cache-oblivious way, thus extending the Nested Parallel (NP) model to the \emph{Nested Dataflow} (ND) model. We redesign several divide-and-conquer algorithms ranging from dense linear algebra to dynamic-programming in the ND model and prove that they all have optimal span while retaining optimal cache complexity. We propose the design of runtime schedulers that map ND programs to multicore processors with multiple levels of possibly shared caches (i.e, Parallel Memory Hierarchies) and provide theoretical guarantees on their ability to preserve locality and load balance. For this, we adapt space-bounded (SB) schedulers for the ND model. We show that our algorithms have increased "parallelizability" in the ND model, and that SB schedulers can use the extra parallelizability to achieve asymptotically optimal bounds on cache misses and running time on a greater number of processors than in the NP model. The running time for the algorithms in this paper is

O\left(\frac{\sum_{i=0}^{h-1} Q^{*}({\mathsf t};\sigma\cdot M_i)\cdot C_i}{p}\right)

, where

Q^{*}

is the cache complexity of task

{\mathsf t}

C_i

is the cost of cache miss at level-

i

cache which is of size

M_i

\sigma\in(0,1)

is a constant, and

p

is the number of processors in an

h

-level cache hierarchy

arXiv.org e-Print Archive

Crossref

Bounding Cache Miss Costs of Multithreaded Computations Under General Schedulers

Author: Cole Richard
Ramachandran Vijaya
Publication venue
Publication date: 28/09/2017
Field of study

We analyze the caching overhead incurred by a class of multithreaded algorithms when scheduled by an arbitrary scheduler. We obtain bounds that match or improve upon the well-known

O(Q+S \cdot (M/B))

caching cost for the randomized work stealing (RWS) scheduler, where

S

is the number of steals,

Q

is the sequential caching cost, and

M

and

B

are the cache size and block (or cache line) size respectively.Comment: Extended abstract in Proceedings of ACM Symp. on Parallel Alg. and Architectures (SPAA) 2017, pp. 339-350. This revision has a few small updates including a missing citation and the replacement of some big Oh terms with precise constant

arXiv.org e-Print Archive

Crossref