2,455 research outputs found
PyCUDA and PyOpenCL: A Scripting-Based Approach to GPU Run-Time Code Generation
High-performance computing has recently seen a surge of interest in
heterogeneous systems, with an emphasis on modern Graphics Processing Units
(GPUs). These devices offer tremendous potential for performance and efficiency
in important large-scale applications of computational science. However,
exploiting this potential can be challenging, as one must adapt to the
specialized and rapidly evolving computing environment currently exhibited by
GPUs. One way of addressing this challenge is to embrace better techniques and
develop tools tailored to their needs. This article presents one simple
technique, GPU run-time code generation (RTCG), along with PyCUDA and PyOpenCL,
two open-source toolkits that support this technique.
In introducing PyCUDA and PyOpenCL, this article proposes the combination of
a dynamic, high-level scripting language with the massive performance of a GPU
as a compelling two-tiered computing platform, potentially offering significant
performance and productivity advantages over conventional single-tier, static
systems. The concept of RTCG is simple and easily implemented using existing,
robust infrastructure. Nonetheless it is powerful enough to support (and
encourage) the creation of custom application-specific tools by its users. The
premise of the paper is illustrated by a wide range of examples where the
technique has been applied with considerable success.Comment: Submitted to Parallel Computing, Elsevie
Effective Extensible Programming: Unleashing Julia on GPUs
GPUs and other accelerators are popular devices for accelerating
compute-intensive, parallelizable applications. However, programming these
devices is a difficult task. Writing efficient device code is challenging, and
is typically done in a low-level programming language. High-level languages are
rarely supported, or do not integrate with the rest of the high-level language
ecosystem. To overcome this, we propose compiler infrastructure to efficiently
add support for new hardware or environments to an existing programming
language.
We evaluate our approach by adding support for NVIDIA GPUs to the Julia
programming language. By integrating with the existing compiler, we
significantly lower the cost to implement and maintain the new compiler, and
facilitate reuse of existing application code. Moreover, use of the high-level
Julia programming language enables new and dynamic approaches for GPU
programming. This greatly improves programmer productivity, while maintaining
application performance similar to that of the official NVIDIA CUDA toolkit
GPU Accelerated Finite Element Assembly with Runtime Compilation
In recent years, high performance scientific computing on graphics processing
units (GPUs) have gained widespread acceptance. These devices are designed to
offer massively parallel threads for running code with general purpose. There
are many researches focus on finite element method with GPUs. However, most of
the works are specific to certain problems and applications. Some works propose
methods for finite element assembly that is general for a wide range of finite
element models. But the development of finite element code is dependent on the
hardware architectures. It is usually complicated and error prone using the
libraries provided by the hardware vendors. In this paper, we present
architecture and implementation of finite element assembly for partial
differential equations (PDEs) based on symbolic computation and runtime
compilation technique on GPU. User friendly programming interface with symbolic
computation is provided. At the same time, high computational efficiency is
achieved by using runtime compilation technique. As far as we know, it is the
first work using this technique to accelerate finite element assembly for
solving PDEs. Experiments show that a one to two orders of speedup is achieved
for the problems studied in the paper.Comment: 6 pages, 8 figures, conferenc
High-level GPU programming in Julia
GPUs are popular devices for accelerating scientific calculations. However,
as GPU code is usually written in low-level languages, it breaks the
abstractions of high-level languages popular with scientific programmers. To
overcome this, we present a framework for CUDA GPU programming in the
high-level Julia programming language. This framework compiles Julia source
code for GPU execution, and takes care of the necessary low-level interactions
using modern code generation techniques to avoid run-time overhead.
Evaluating the framework and its APIs on a case study comprising the trace
transform from the field of image processing, we find that the impact on
performance is minimal, while greatly increasing programmer productivity. The
metaprogramming capabilities of the Julia language proved invaluable for
enabling this. Our framework significantly improves usability of GPUs, making
them accessible for a wide range of programmers. It is available as free and
open-source software licensed under the MIT License
Analytical Cost Metrics : Days of Future Past
As we move towards the exascale era, the new architectures must be capable of
running the massive computational problems efficiently. Scientists and
researchers are continuously investing in tuning the performance of
extreme-scale computational problems. These problems arise in almost all areas
of computing, ranging from big data analytics, artificial intelligence, search,
machine learning, virtual/augmented reality, computer vision, image/signal
processing to computational science and bioinformatics. With Moore's law
driving the evolution of hardware platforms towards exascale, the dominant
performance metric (time efficiency) has now expanded to also incorporate
power/energy efficiency. Therefore, the major challenge that we face in
computing systems research is: "how to solve massive-scale computational
problems in the most time/power/energy efficient manner?"
The architectures are constantly evolving making the current performance
optimizing strategies less applicable and new strategies to be invented. The
solution is for the new architectures, new programming models, and applications
to go forward together. Doing this is, however, extremely hard. There are too
many design choices in too many dimensions. We propose the following strategy
to solve the problem: (i) Models - Develop accurate analytical models (e.g.
execution time, energy, silicon area) to predict the cost of executing a given
program, and (ii) Complete System Design - Simultaneously optimize all the cost
models for the programs (computational problems) to obtain the most
time/area/power/energy efficient solution. Such an optimization problem evokes
the notion of codesign
Programming CUDA and OpenCL: A Case Study Using Modern C++ Libraries
We present a comparison of several modern C++ libraries providing high-level
interfaces for programming multi- and many-core architectures on top of CUDA or
OpenCL. The comparison focuses on the solution of ordinary differential
equations and is based on odeint, a framework for the solution of systems of
ordinary differential equations. Odeint is designed in a very flexible way and
may be easily adapted for effective use of libraries such as Thrust, MTL4,
VexCL, or ViennaCL, using CUDA or OpenCL technologies. We found that CUDA and
OpenCL work equally well for problems of large sizes, while OpenCL has higher
overhead for smaller problems. Furthermore, we show that modern high-level
libraries allow to effectively use the computational resources of many-core
GPUs or multi-core CPUs without much knowledge of the underlying technologies.Comment: 21 pages, 4 figures, submitted to SIAM Journal of Scientific
Computing and accepte
Resource-Aware Just-in-Time OpenCL Compiler for Coarse-Grained FPGA Overlays
FPGA vendors have recently started focusing on OpenCL for FPGAs because of
its ability to leverage the parallelism inherent to heterogeneous computing
platforms. OpenCL allows programs running on a host computer to launch
accelerator kernels which can be compiled at run-time for a specific
architecture, thus enabling portability. However, the prohibitive compilation
times (specifically the FPGA place and route times) are a major stumbling block
when using OpenCL tools from FPGA vendors. The long compilation times mean that
the tools cannot effectively use just-in-time (JIT) compilation or runtime
performance scaling. Coarse-grained overlays represent a possible solution by
virtue of their coarse granularity and fast compilation. In this paper, we
present a methodology for run-time compilation of OpenCL kernels to a DSP block
based coarse-grained overlay, rather than directly to the fine-grained FPGA
fabric. The proposed methodology allows JIT compilation and on-demand
resource-aware kernel replication to better utilize available overlay
resources, raising the abstraction level while reducing compile times
significantly. We further demonstrate that this approach can even be used for
run-time compilation of OpenCL kernels on the ARM processor of the embedded
heterogeneous Zynq device.Comment: Presented at 3rd International Workshop on Overlay Architectures for
FPGAs (OLAF 2017) arXiv:1704.0880
GPU Scripting and Code Generation with PyCUDA
High-level scripting languages are in many ways polar opposites to GPUs. GPUs
are highly parallel, subject to hardware subtleties, and designed for maximum
throughput, and they offer a tremendous advance in the performance achievable
for a significant number of computational problems. On the other hand,
scripting languages such as Python favor ease of use over computational speed
and do not generally emphasize parallelism. PyCUDA is a package that attempts
to join the two together. This chapter argues that in doing so, a programming
environment is created that is greater than just the sum of its two parts.
We would like to note that nearly all of this chapter applies in unmodified
form to PyOpenCL, a sister project of PyCUDA, whose goal it is to realize the
same concepts as PyCUDA for OpenCL
Strategy Preserving Compilation for Parallel Functional Code
Graphics Processing Units (GPUs) and other parallel devices are widely
available and have the potential for accelerating a wide class of algorithms.
However, expert programming skills are required to achieving maximum
performance. hese devices expose low-level hardware details through imperative
programming interfaces where programmers explicity encode device-specific
optimisation strategies. This inevitably results in non-performance-portable
programs delivering suboptimal performance on other devices.
Functional programming models have recently seen a renaissance in the systems
community as they offer possible solutions for tackling the performance
portability challenge. Recent work has shown how to automatically choose
high-performance parallelisation strategies for a wide range of hardware
architectures encoded in a functional representation. However, the translation
of such functional representations to the imperative program expected by the
hardware interface is typically performed ad hoc with no correctness guarantees
and no guarantees to preserve the intended parallelisation strategy.
In this paper, we present a formalised strategy-preserving translation from
high-level functional code to low-level data race free parallel imperative
code. This translation is formulated and proved correct within a language we
call Data Parallel Idealised Algol (DPIA), a dialect of Reynolds' Idealised
Algol. Performance results on GPUs and a multicore CPU show that the formalised
translation process generates low-level code with performance on a par with
code generated from ad hoc approaches
ClangJIT: Enhancing C++ with Just-in-Time Compilation
The C++ programming language is not only a keystone of the
high-performance-computing ecosystem but has proven to be a successful base for
portable parallel-programming frameworks. As is well known, C++ programmers use
templates to specialize algorithms, thus allowing the compiler to generate
highly-efficient code for specific parameters, data structures, and so on. This
capability has been limited to those specializations that can be identified
when the application is compiled, and in many critical cases, compiling all
potentially-relevant specializations is not practical. ClangJIT provides a
well-integrated C++ language extension allowing template-based specialization
to occur during program execution. This capability has been implemented for use
in large-scale applications, and we demonstrate that
just-in-time-compilation-based dynamic specialization can be integrated into
applications, often requiring minimal changes (or no changes) to the
applications themselves, providing significant performance improvements,
programmer-productivity improvements, and decreased compilation time
- …