301 research outputs found
Contract-Based General-Purpose GPU Programming
Using GPUs as general-purpose processors has revolutionized parallel
computing by offering, for a large and growing set of algorithms, massive
data-parallelization on desktop machines. An obstacle to widespread adoption,
however, is the difficulty of programming them and the low-level control of the
hardware required to achieve good performance. This paper suggests a
programming library, SafeGPU, that aims at striking a balance between
programmer productivity and performance, by making GPU data-parallel operations
accessible from within a classical object-oriented programming language. The
solution is integrated with the design-by-contract approach, which increases
confidence in functional program correctness by embedding executable program
specifications into the program text. We show that our library leads to modular
and maintainable code that is accessible to GPGPU non-experts, while providing
performance that is comparable with hand-written CUDA code. Furthermore,
runtime contract checking turns out to be feasible, as the contracts can be
executed on the GPU
Accelerating sequential programs using FastFlow and self-offloading
FastFlow is a programming environment specifically targeting cache-coherent
shared-memory multi-cores. FastFlow is implemented as a stack of C++ template
libraries built on top of lock-free (fence-free) synchronization mechanisms. In
this paper we present a further evolution of FastFlow enabling programmers to
offload part of their workload on a dynamically created software accelerator
running on unused CPUs. The offloaded function can be easily derived from
pre-existing sequential code. We emphasize in particular the effective
trade-off between human productivity and execution efficiency of the approach.Comment: 17 pages + cove
Algorithmic skeleton framework for the orchestration of GPU computations
Dissertação para obtenção do Grau de Mestre em
Engenharia InformáticaThe Graphics Processing Unit (GPU) is gaining popularity as a co-processor to the
Central Processing Unit (CPU), due to its ability to surpass the latter’s performance in certain application fields. Nonetheless, harnessing the GPU’s capabilities is a non-trivial exercise that requires good knowledge of parallel programming. Thus, providing ways to extract such computational power has become an emerging research topic.
In this context, there have been several proposals in the field of GPGPU (Generalpurpose Computation on Graphics Processing Unit) development. However, most of these still offer a low-level abstraction of the GPU computing model, forcing the developer to adapt application computations in accordance with the SPMD model, as well as
to orchestrate the low-level details of the execution. On the other hand, the higher-level approaches have limitations that prevent the full exploitation of GPUs when the purpose goes beyond the simple offloading of a kernel.
To this extent, our proposal builds on the recent trend of applying the notion of algorithmic patterns (skeletons) to GPU computing. We propose Marrow, a high-level algorithmic skeleton framework that expands the set of skeletons currently available in
this field. Marrow’s skeletons orchestrate the execution of OpenCL computations and
introduce optimizations that overlap communication and computation, thus conjoining programming simplicity with performance gains in many application scenarios. Additionally, these skeletons can be combined (nested) to create more complex applications.
We evaluated the proposed constructs by confronting them against the comparable
skeleton libraries for GPGPU, as well as against hand-tuned OpenCL programs. The
results are favourable, indicating that Marrow’s skeletons are both flexible and efficient in the context of GPU computing.FCT-MCTES - financing the equipmen
Heterogeneous computing with an algorithmic skeleton framework
The Graphics Processing Unit (GPU) is present in almost every modern day personal
computer. Despite its specific purpose design, they have been increasingly used for general
computations with very good results. Hence, there is a growing effort from the community
to seamlessly integrate this kind of devices in everyday computing. However, to
fully exploit the potential of a system comprising GPUs and CPUs, these devices should
be presented to the programmer as a single platform.
The efficient combination of the power of CPU and GPU devices is highly dependent
on each device’s characteristics, resulting in platform specific applications that cannot
be ported to different systems. Also, the most efficient work balance among devices is
highly dependable on the computations to be performed and respective data sizes.
In this work, we propose a solution for heterogeneous environments based on the
abstraction level provided by algorithmic skeletons. Our goal is to take full advantage of
the power of all CPU and GPU devices present in a system, without the need for different
kernel implementations nor explicit work-distribution.To that end, we extended Marrow,
an algorithmic skeleton framework for multi-GPUs, to support CPU computations and
efficiently balance the work-load between devices. Our approach is based on an offline
training execution that identifies the ideal work balance and platform configurations for
a given application and input data size.
The evaluation of this work shows that the combination of CPU and GPU devices can
significantly boost the performance of our benchmarks in the tested environments, when
compared to GPU-only executions
Towards an Adaptive Skeleton Framework for Performance Portability
The proliferation of widely available, but very different, parallel architectures
makes the ability to deliver good parallel performance
on a range of architectures, or performance portability, highly desirable.
Irregularly-parallel problems, where the number and size
of tasks is unpredictable, are particularly challenging and require
dynamic coordination.
The paper outlines a novel approach to delivering portable parallel
performance for irregularly parallel programs. The approach
combines declarative parallelism with JIT technology, dynamic
scheduling, and dynamic transformation.
We present the design of an adaptive skeleton library, with a task
graph implementation, JIT trace costing, and adaptive transformations.
We outline the architecture of the protoype adaptive skeleton
execution framework in Pycket, describing tasks, serialisation,
and the current scheduler.We report a preliminary evaluation of the
prototype framework using 4 micro-benchmarks and a small case
study on two NUMA servers (24 and 96 cores) and a small cluster
(17 hosts, 272 cores). Key results include Pycket delivering good
sequential performance e.g. almost as fast as C for some benchmarks;
good absolute speedups on all architectures (up to 120 on
128 cores for sumEuler); and that the adaptive transformations do
improve performance
Multi-GPU support on the marrow algorithmic skeleton framework
Dissertação para obtenção do Grau de Mestre em
Engenharia InformáticaWith the proliferation of general purpose GPUs, workload parallelization and datatransfer optimization became an increasing concern. The natural evolution from using a single GPU, is multiplying the amount of available processors, presenting new challenges, as tuning the workload decompositions and load balancing, when dealing with heterogeneous systems.
Higher-level programming is a very important asset in a multi-GPU environment, due to the complexity inherent to the currently used GPGPU APIs (OpenCL and CUDA), because of their low-level and code overhead. This can be obtained by introducing an abstraction layer, which has the advantage of enabling implicit optimizations and orchestrations
such as transparent load balancing mechanism and reduced explicit code overhead.
Algorithmic Skeletons, previously used in cluster environments, have recently been
adapted to the GPGPU context. Skeletons abstract most sources of code overhead, by
defining computation patterns of commonly used algorithms. The Marrow algorithmic
skeleton library is one of these, taking advantage of the abstractions to automate the
orchestration needed for an efficient GPU execution.
This thesis proposes the extension of Marrow to leverage the use of algorithmic skeletons
in the modular and efficient programming of multiple heterogeneous GPUs, within a single machine.
We were able to achieve a good balance between simplicity of the programming model and performance, obtaining good scalability when using multiple GPUs, with an efficient load distribution, although at the price of some overhead when using a single-GPU.projects PTDC/EIA-EIA/102579/2008 and PTDC/EIA-EIA/111518/200
Toward optimised skeletons for heterogeneous parallel architecture with performance cost model
High performance architectures are increasingly heterogeneous with shared and
distributed memory components, and accelerators like GPUs. Programming such
architectures is complicated and performance portability is a major issue as the
architectures evolve. This thesis explores the potential for algorithmic skeletons
integrating a dynamically parametrised static cost model, to deliver portable
performance for mostly regular data parallel programs on heterogeneous archi-
tectures.
The rst contribution of this thesis is to address the challenges of program-
ming heterogeneous architectures by providing two skeleton-based programming
libraries: i.e. HWSkel for heterogeneous multicore clusters and GPU-HWSkel
that enables GPUs to be exploited as general purpose multi-processor devices.
Both libraries provide heterogeneous data parallel algorithmic skeletons including
hMap, hMapAll, hReduce, hMapReduce, and hMapReduceAll.
The second contribution is the development of cost models for workload dis-
tribution. First, we construct an architectural cost model (CM1) to optimise
overall processing time for HWSkel heterogeneous skeletons on a heterogeneous
system composed of networks of arbitrary numbers of nodes, each with an ar-
bitrary number of cores sharing arbitrary amounts of memory. The cost model
characterises the components of the architecture by the number of cores, clock
speed, and crucially the size of the L2 cache. Second, we extend the HWSkel cost
model (CM1) to account for GPU performance. The extended cost model (CM2)
is used in the GPU-HWSkel library to automatically nd a good distribution
for both a single heterogeneous multicore/GPU node, and clusters of heteroge-
neous multicore/GPU nodes. Experiments are carried out on three heterogeneous
multicore clusters, four heterogeneous multicore/GPU clusters, and three single
heterogeneous multicore/GPU nodes. The results of experimental evaluations for
four data parallel benchmarks, i.e. sumEuler, Image matching, Fibonacci, and
Matrix Multiplication, show that our combined heterogeneous skeletons and cost
models can make good use of resources in heterogeneous systems. Moreover using
cores together with a GPU in the same host can deliver good performance either
on a single node or on multiple node architectures
- …