623 research outputs found
Task-based adaptive multiresolution for time-space multi-scale reaction-diffusion systems on multi-core architectures
A new solver featuring time-space adaptation and error control has been
recently introduced to tackle the numerical solution of stiff
reaction-diffusion systems. Based on operator splitting, finite volume adaptive
multiresolution and high order time integrators with specific stability
properties for each operator, this strategy yields high computational
efficiency for large multidimensional computations on standard architectures
such as powerful workstations. However, the data structure of the original
implementation, based on trees of pointers, provides limited opportunities for
efficiency enhancements, while posing serious challenges in terms of parallel
programming and load balancing. The present contribution proposes a new
implementation of the whole set of numerical methods including Radau5 and
ROCK4, relying on a fully different data structure together with the use of a
specific library, TBB, for shared-memory, task-based parallelism with
work-stealing. The performance of our implementation is assessed in a series of
test-cases of increasing difficulty in two and three dimensions on multi-core
and many-core architectures, demonstrating high scalability
Towards an Adaptive Skeleton Framework for Performance Portability
The proliferation of widely available, but very different, parallel architectures
makes the ability to deliver good parallel performance
on a range of architectures, or performance portability, highly desirable.
Irregularly-parallel problems, where the number and size
of tasks is unpredictable, are particularly challenging and require
dynamic coordination.
The paper outlines a novel approach to delivering portable parallel
performance for irregularly parallel programs. The approach
combines declarative parallelism with JIT technology, dynamic
scheduling, and dynamic transformation.
We present the design of an adaptive skeleton library, with a task
graph implementation, JIT trace costing, and adaptive transformations.
We outline the architecture of the protoype adaptive skeleton
execution framework in Pycket, describing tasks, serialisation,
and the current scheduler.We report a preliminary evaluation of the
prototype framework using 4 micro-benchmarks and a small case
study on two NUMA servers (24 and 96 cores) and a small cluster
(17 hosts, 272 cores). Key results include Pycket delivering good
sequential performance e.g. almost as fast as C for some benchmarks;
good absolute speedups on all architectures (up to 120 on
128 cores for sumEuler); and that the adaptive transformations do
improve performance
Achieving High Performance and High Productivity in Next Generational Parallel Programming Languages
Processor design has turned toward parallelism and heterogeneity
cores to achieve performance and energy efficiency. Developers
find high-level languages attractive because they use abstraction
to offer productivity and portability over hardware complexities.
To achieve performance, some modern implementations of high-level
languages use work-stealing scheduling for load balancing of
dynamically created tasks. Work-stealing is a promising approach
for effectively exploiting software parallelism on parallel
hardware. A programmer who uses work-stealing explicitly
identifies potential parallelism and the runtime then schedules
work, keeping otherwise idle hardware busy while relieving
overloaded hardware of its burden.
However, work-stealing comes with substantial overheads. These
overheads arise as a necessary side effect of the implementation
and hamper parallel performance. In addition to runtime-imposed
overheads, there is a substantial cognitive load associated with
ensuring that parallel code is data-race free. This dissertation
explores the overheads associated with achieving high performance
parallelism in modern high-level languages.
My thesis is that, by exploiting existing underlying mechanisms
of managed runtimes; and by extending existing language design,
high-level languages will be able to deliver productivity and
parallel performance at the levels necessary for widespread
uptake.
The key contributions of my thesis are: 1) a detailed analysis of
the key sources of overhead associated with a work-stealing
runtime, namely sequential and dynamic overheads; 2) novel
techniques to reduce these overheads that use rich features of
managed runtimes such as the yieldpoint mechanism, on-stack
replacement, dynamic code-patching, exception handling support,
and return barriers; 3) comprehensive analysis of the resulting
benefits, which demonstrate that work-stealing overheads can be
significantly reduced, leading to substantial performance
improvements; and 4) a small set of language extensions that
achieve both high performance and high productivity with minimal
programmer effort.
A managed runtime forms the backbone of any modern implementation
of a high-level language. Managed runtimes enjoy the benefits of
a long history of research and their implementations are highly
optimized. My thesis demonstrates that converging these highly
optimized features together with the expressiveness of high-level
languages, gives further hope for achieving high performance and
high productivity on modern parallel hardwar
A taxonomy of task-based parallel programming technologies for high-performance computing
Task-based programming models for shared memory -- such as Cilk Plus and OpenMP 3 -- are well established and documented. However, with the increase in parallel, many-core and heterogeneous systems, a number of research-driven projects have developed more diversified task-based support, employing various programming and runtime features. Unfortunately, despite the fact that dozens of different task-based systems exist today and are actively used for parallel and high-performance computing (HPC), no comprehensive overview or classification of task-based technologies for HPC exists.
In this paper, we provide an initial task-focused taxonomy for HPC technologies, which covers both programming interfaces and runtime mechanisms. We demonstrate the usefulness of our taxonomy by classifying state-of-the-art task-based environments in use today
Configurable Strategies for Work-stealing
Work-stealing systems are typically oblivious to the nature of the tasks they
are scheduling. For instance, they do not know or take into account how long a
task will take to execute or how many subtasks it will spawn. Moreover, the
actual task execution order is typically determined by the underlying task
storage data structure, and cannot be changed. There are thus possibilities for
optimizing task parallel executions by providing information on specific tasks
and their preferred execution order to the scheduling system.
We introduce scheduling strategies to enable applications to dynamically
provide hints to the task-scheduling system on the nature of specific tasks.
Scheduling strategies can be used to independently control both local task
execution order as well as steal order. In contrast to conventional scheduling
policies that are normally global in scope, strategies allow the scheduler to
apply optimizations on individual tasks. This flexibility greatly improves
composability as it allows the scheduler to apply different, specific
scheduling choices for different parts of applications simultaneously. We
present a number of benchmarks that highlight diverse, beneficial effects that
can be achieved with scheduling strategies. Some benchmarks (branch-and-bound,
single-source shortest path) show that prioritization of tasks can reduce the
total amount of work compared to standard work-stealing execution order. For
other benchmarks (triangle strip generation) qualitatively better results can
be achieved in shorter time. Other optimizations, such as dynamic merging of
tasks or stealing of half the work, instead of half the tasks, are also shown
to improve performance. Composability is demonstrated by examples that combine
different strategies, both within the same kernel (prefix sum) as well as when
scheduling multiple kernels (prefix sum and unbalanced tree search)
Locality-Aware Dynamic Task Graph Scheduling
Dynamic task graph schedulers automatically balance work across processor cores by scheduling tasks among available threads while preserving dependences. In this paper, we design NabbitC, a provably efficient dynamic task graph scheduler that accounts for data locality on NUMA systems. NabbitC allows users to assign a color to each task representing the location (e.g., a processor core) that has the most efficient access to data needed during that node’s execution. NabbitC then automatically adjusts the scheduling so as to preferentially execute each node at the location that matches its color—leading to better locality because the node is likely to make local rather than remote accesses. At the same time, NabbitC tries to optimize load balance and not add too much overhead compared to the vanilla Nabbit scheduler that does not consider locality. We provide a theoretical analysis that shows that NabbitC does not asymptotically impact the scalability of Nabbit . We evaluated the performance of NabbitC on a suite of memory intensive benchmarks. Our experiments indicates that adding locality awareness has a considerable performance advantage compared to the vanilla Nabbit scheduler. In addition, we also compared NabbitC to OpenMP programs for both regular and irregular applications. For regular applications, OpenMP achieves perfect locality and perfect load balance statically. For these benchmarks, NabbitC has a small performance penalty compared to OpenMP due to its dynamic scheduling strategy. For irregular applications, where OpenMP can not achieve locality and load balance simultaneously, we find that NabbitC performs better. Therefore, NabbitC combines the benefits of locality- aware scheduling for regular applications (the forte of static schedulers such as those in OpenMP) and dynamically adapting to load imbalance (the forte of dynamic schedulers such as Cilk Plus, TBB, and Nabbit)
Adaptive query parallelization in multi-core column stores
With the rise of multi-core CPU platforms, their optimal utilization
for in-memory OLAP workloads using column store databases has
become one of the biggest challenges. Some of the inherent limi-
tations in the achievable query parallelism are due to the degree of
parallelism dependency on the data skew, the overheads incurred by
thread coordination, and the hardware resource limits. Finding the
right balance between the degree of parallelism and the multi-core
utilizati
- …