2,155 research outputs found
Cooperative auto-tuning of parallel skeletons
Improving program performance through the use of multiple homogeneous processing
elements, or cores, is common-place. However, these architectures increase the
complexity required at the software level. Existing work is focused on optimising
programs that run in isolation on these systems, but ignores the fact that, in reality,
these systems run multiple parallel programs concurrently with programs competing
for system resources. In order to improve performance in this shared environment,
cooperative tuning of multiple, concurrently running parallel programs is required.
Moreover, the set of programs running on the system – the system workload – is dynamic
and rapidly changing. This makes cooperative tuning a challenge, as it must
react rapidly to changes in the system workload.
This thesis explores the scope for performance improvement from cooperatively
tuning skeleton parallel programs, and techniques that can be used to cooperatively
auto-tune parallel programs. Parallel skeletons provide a clear separation between
algorithm description and implementation, and provide tuning knobs that the system
can use to make high-level changes to a programs implementation. This work
is in three parts: (i) how many threads should be allocated to each program running
on the system, (ii) on which cores should a programs threads be executed and
(iii) what values should be chosen for high-level parameters of the parallel skeletons.
We demonstrate that significant performance improvements are available in each of
these areas, compared to the current state-of-the-art
Autonomic management of multiple non-functional concerns in behavioural skeletons
We introduce and address the problem of concurrent autonomic management of
different non-functional concerns in parallel applications build as a
hierarchical composition of behavioural skeletons. We first define the problems
arising when multiple concerns are dealt with by independent managers, then we
propose a methodology supporting coordinated management, and finally we discuss
how autonomic management of multiple concerns may be implemented in a typical
use case. The paper concludes with an outline of the challenges involved in
realizing the proposed methodology on distributed target architectures such as
clusters and grids. Being based on the behavioural skeleton concept proposed in
the CoreGRID GCM, it is anticipated that the methodology will be readily
integrated into the current reference implementation of GCM based on Java
ProActive and running on top of major grid middleware systems.Comment: 20 pages + cover pag
Towards an Adaptive Skeleton Framework for Performance Portability
The proliferation of widely available, but very different, parallel architectures
makes the ability to deliver good parallel performance
on a range of architectures, or performance portability, highly desirable.
Irregularly-parallel problems, where the number and size
of tasks is unpredictable, are particularly challenging and require
dynamic coordination.
The paper outlines a novel approach to delivering portable parallel
performance for irregularly parallel programs. The approach
combines declarative parallelism with JIT technology, dynamic
scheduling, and dynamic transformation.
We present the design of an adaptive skeleton library, with a task
graph implementation, JIT trace costing, and adaptive transformations.
We outline the architecture of the protoype adaptive skeleton
execution framework in Pycket, describing tasks, serialisation,
and the current scheduler.We report a preliminary evaluation of the
prototype framework using 4 micro-benchmarks and a small case
study on two NUMA servers (24 and 96 cores) and a small cluster
(17 hosts, 272 cores). Key results include Pycket delivering good
sequential performance e.g. almost as fast as C for some benchmarks;
good absolute speedups on all architectures (up to 120 on
128 cores for sumEuler); and that the adaptive transformations do
improve performance
High-level programming of stencil computations on multi-GPU systems using the SkelCL library
The implementation of stencil computations on modern, massively parallel systems with GPUs and other accelerators currently relies on manually-tuned coding using low-level approaches like OpenCL and CUDA. This makes development of stencil applications a complex, time-consuming, and error-prone task. We describe how stencil computations can be programmed in our SkelCL approach that combines high-level programming abstractions with competitive performance on multi-GPU systems. SkelCL extends the OpenCL standard by three high-level features: 1) pre-implemented parallel patterns (a.k.a. skeletons); 2) container data types for vectors and matrices; 3) automatic data (re)distribution mechanism. We introduce two new SkelCL skeletons which specifically target stencil computations – MapOverlap and Stencil – and we describe their use for particular application examples, discuss their efficient parallel implementation, and report experimental results on systems with multiple GPUs. Our evaluation of three real-world applications shows that stencil code written with SkelCL is considerably shorter and offers competitive performance to hand-tuned OpenCL code
Multi-GPU support on the marrow algorithmic skeleton framework
Dissertação para obtenção do Grau de Mestre em
Engenharia InformáticaWith the proliferation of general purpose GPUs, workload parallelization and datatransfer optimization became an increasing concern. The natural evolution from using a single GPU, is multiplying the amount of available processors, presenting new challenges, as tuning the workload decompositions and load balancing, when dealing with heterogeneous systems.
Higher-level programming is a very important asset in a multi-GPU environment, due to the complexity inherent to the currently used GPGPU APIs (OpenCL and CUDA), because of their low-level and code overhead. This can be obtained by introducing an abstraction layer, which has the advantage of enabling implicit optimizations and orchestrations
such as transparent load balancing mechanism and reduced explicit code overhead.
Algorithmic Skeletons, previously used in cluster environments, have recently been
adapted to the GPGPU context. Skeletons abstract most sources of code overhead, by
defining computation patterns of commonly used algorithms. The Marrow algorithmic
skeleton library is one of these, taking advantage of the abstractions to automate the
orchestration needed for an efficient GPU execution.
This thesis proposes the extension of Marrow to leverage the use of algorithmic skeletons
in the modular and efficient programming of multiple heterogeneous GPUs, within a single machine.
We were able to achieve a good balance between simplicity of the programming model and performance, obtaining good scalability when using multiple GPUs, with an efficient load distribution, although at the price of some overhead when using a single-GPU.projects PTDC/EIA-EIA/102579/2008 and PTDC/EIA-EIA/111518/200
SkelCL: enhancing OpenCL for high-level programming of multi-GPU systems
Application development for modern high-performance systems with Graphics Processing Units (GPUs) currently relies on low-level programming approaches like CUDA and OpenCL, which leads to complex, lengthy and error-prone programs.
In this paper, we present SkelCL – a high-level programming approach for systems with multiple GPUs and its implementation as a library on top of OpenCL. SkelCL provides three main enhancements to the OpenCL standard: 1) computations are conveniently expressed using parallel algorithmic patterns (skeletons); 2) memory management is simplified using parallel
container data types (vectors and matrices); 3) an automatic data (re)distribution mechanism allows for implicit data movements between
GPUs and ensures scalability when using multiple GPUs. We demonstrate how SkelCL is used to implement parallel applications on one- and two-dimensional data. We report experimental results to evaluate our approach in terms of programming effort and performance
- …