4,086 research outputs found
PyCUDA and PyOpenCL: A Scripting-Based Approach to GPU Run-Time Code Generation
High-performance computing has recently seen a surge of interest in
heterogeneous systems, with an emphasis on modern Graphics Processing Units
(GPUs). These devices offer tremendous potential for performance and efficiency
in important large-scale applications of computational science. However,
exploiting this potential can be challenging, as one must adapt to the
specialized and rapidly evolving computing environment currently exhibited by
GPUs. One way of addressing this challenge is to embrace better techniques and
develop tools tailored to their needs. This article presents one simple
technique, GPU run-time code generation (RTCG), along with PyCUDA and PyOpenCL,
two open-source toolkits that support this technique.
In introducing PyCUDA and PyOpenCL, this article proposes the combination of
a dynamic, high-level scripting language with the massive performance of a GPU
as a compelling two-tiered computing platform, potentially offering significant
performance and productivity advantages over conventional single-tier, static
systems. The concept of RTCG is simple and easily implemented using existing,
robust infrastructure. Nonetheless it is powerful enough to support (and
encourage) the creation of custom application-specific tools by its users. The
premise of the paper is illustrated by a wide range of examples where the
technique has been applied with considerable success.Comment: Submitted to Parallel Computing, Elsevie
Hot-Rodding the Browser Engine: Automatic Configuration of JavaScript Compilers
Modern software systems in many application areas offer to the user a
multitude of parameters, switches and other customisation hooks. Humans tend to
have difficulties determining the best configurations for particular
applications. Modern optimising compilers are an example of such software
systems; their many parameters need to be tuned for optimal performance, but
are often left at the default values for convenience. In this work, we
automatically determine compiler parameter settings that result in optimised
performance for particular applications. Specifically, we apply a
state-of-the-art automated parameter configuration procedure based on
cutting-edge machine learning and optimisation techniques to two prominent
JavaScript compilers and demonstrate that significant performance improvements,
more than 35% in some cases, can be achieved over the default parameter
settings on a diverse set of benchmarks.Comment: 11 pages, long version of a poster presented at CGO 201
CoCoPIE: Making Mobile AI Sweet As PIE --Compression-Compilation Co-Design Goes a Long Way
Assuming hardware is the major constraint for enabling real-time mobile
intelligence, the industry has mainly dedicated their efforts to developing
specialized hardware accelerators for machine learning and inference. This
article challenges the assumption. By drawing on a recent real-time AI
optimization framework CoCoPIE, it maintains that with effective
compression-compiler co-design, it is possible to enable real-time artificial
intelligence on mainstream end devices without special hardware. CoCoPIE is a
software framework that holds numerous records on mobile AI: the first
framework that supports all main kinds of DNNs, from CNNs to RNNs, transformer,
language models, and so on; the fastest DNN pruning and acceleration framework,
up to 180X faster compared with current DNN pruning on other frameworks such as
TensorFlow-Lite; making many representative AI applications able to run in
real-time on off-the-shelf mobile devices that have been previously regarded
possible only with special hardware support; making off-the-shelf mobile
devices outperform a number of representative ASIC and FPGA solutions in terms
of energy efficiency and/or performance
GPU Scripting and Code Generation with PyCUDA
High-level scripting languages are in many ways polar opposites to GPUs. GPUs
are highly parallel, subject to hardware subtleties, and designed for maximum
throughput, and they offer a tremendous advance in the performance achievable
for a significant number of computational problems. On the other hand,
scripting languages such as Python favor ease of use over computational speed
and do not generally emphasize parallelism. PyCUDA is a package that attempts
to join the two together. This chapter argues that in doing so, a programming
environment is created that is greater than just the sum of its two parts.
We would like to note that nearly all of this chapter applies in unmodified
form to PyOpenCL, a sister project of PyCUDA, whose goal it is to realize the
same concepts as PyCUDA for OpenCL
Performance and Optimization Abstractions for Large Scale Heterogeneous Systems in the Cactus/Chemora Framework
We describe a set of lower-level abstractions to improve performance on
modern large scale heterogeneous systems. These provide portable access to
system- and hardware-dependent features, automatically apply dynamic
optimizations at run time, and target stencil-based codes used in finite
differencing, finite volume, or block-structured adaptive mesh refinement
codes.
These abstractions include a novel data structure to manage refinement
information for block-structured adaptive mesh refinement, an iterator
mechanism to efficiently traverse multi-dimensional arrays in stencil-based
codes, and a portable API and implementation for explicit SIMD vectorization.
These abstractions can either be employed manually, or be targeted by
automated code generation, or be used via support libraries by compilers during
code generation. The implementations described below are available in the
Cactus framework, and are used e.g. in the Einstein Toolkit for relativistic
astrophysics simulations
Gauge Field Generation on Large-Scale GPU-Enabled Systems
Over the past years GPUs have been successfully applied to the task of
inverting the fermion matrix in lattice QCD calculations. Even strong scaling
to capability-level supercomputers, corresponding to O(100) GPUs or more has
been achieved. However strong scaling a whole gauge field generation algorithm
to this regim requires significantly more functionality than just having the
matrix inverter utilizing the GPUs and has not yet been accomplished. This
contribution extends QDP-JIT, the migration of SciDAC QDP++ to GPU-enabled
parallel systems, to help to strong scale the whole Hybrid Monte-Carlo to this
regime. Initial results are shown for gauge field generation with Chroma
simulating pure Wilson fermions on OLCF TitanDev.Comment: The 30th International Symposium on Lattice Field Theory, June 24-29,
2012, Cairns, Australia (Acknowledgment and Citation added
MP-STREAM: A Memory Performance Benchmark for Design Space Exploration on Heterogeneous HPC Devices
Sustained memory throughput is a key determinant
of performance in HPC devices. Having an accurate estimate of
this parameter is essential for manual or automated design space
exploration for any HPC device. While there are benchmarks for
measuring the sustained memory bandwidth for CPUs and GPUs,
such a benchmark for FPGAs has been missing. We present
MP-STREAM, an OpenCL-based synthetic micro-benchmark for
measuring sustained memory bandwidth, optimized for FPGAs,
but which can be used on multiple platforms. Our main contribution
is the introduction of various generic as well as device-specific
parameters that can be tuned to measure their effect on memory
bandwidth. We present results of running our benchmark on a
CPU, a GPU and two FPGA targets, and discuss our observations.
The experiments underline the utility of our benchmark for
optimizing HPC applications for FPGAs, and provide valuable
optimization hints for FPGA programmers
Parallel Programming Models for Heterogeneous Many-Cores : A Survey
Heterogeneous many-cores are now an integral part of modern computing systems
ranging from embedding systems to supercomputers. While heterogeneous many-core
design offers the potential for energy-efficient high-performance, such
potential can only be unlocked if the application programs are suitably
parallel and can be made to match the underlying heterogeneous platform. In
this article, we provide a comprehensive survey for parallel programming models
for heterogeneous many-core architectures and review the compiling techniques
of improving programmability and portability. We examine various software
optimization techniques for minimizing the communicating overhead between
heterogeneous computing devices. We provide a road map for a wide variety of
different research areas. We conclude with a discussion on open issues in the
area and potential research directions. This article provides both an
accessible introduction to the fast-moving area of heterogeneous programming
and a detailed bibliography of its main achievements.Comment: Accepted to be published at CCF Transactions on High Performance
Computin
Practical Implementation of Lattice QCD Simulation on SIMD Machines with Intel AVX-512
We investigate implementation of lattice Quantum Chromodynamics (QCD) code on
the Intel AVX-512 architecture. The most time consuming part of the numerical
simulations of lattice QCD is a solver of linear equation for a large sparse
matrix that represents the strong interaction among quarks. To establish widely
applicable prescriptions, we examine rather general methods for the SIMD
architecture of AVX-512, such as using intrinsics and manual prefetching, for
the matrix multiplication. Based on experience on the Oakforest-PACS system, a
large scale cluster composed of Intel Xeon Phi Knights Landing, we discuss the
performance tuning exploiting AVX-512 and code design on the SIMD architecture
and massively parallel machines. We observe that the same code runs efficiently
on an Intel Xeon Skylake-SP machine.Comment: 17 pages, 9 figures, talk given by I.K. at the Workshop Large Scale
Computational Physics (LSCP 2018) in the 18th International Conference on
Computational Science and its Applications (ICCSA 2018), 2-5 July 2018,
Melbourne. arXiv admin note: text overlap with arXiv:1712.0150
A Scala Prototype to Generate Multigrid Solver Implementations for Different Problems and Target Multi-Core Platforms
Many problems in computational science and engineering involve partial
differential equations and thus require the numerical solution of large, sparse
(non)linear systems of equations. Multigrid is known to be one of the most
efficient methods for this purpose. However, the concrete multigrid algorithm
and its implementation highly depend on the underlying problem and hardware.
Therefore, changes in the code or many different variants are necessary to
cover all relevant cases. In this article we provide a prototype implementation
in Scala for a framework that allows abstract descriptions of PDEs, their
discretization, and their numerical solution via multigrid algorithms. From
these, one is able to generate data structures and implementations of multigrid
components required to solve elliptic PDEs on structured grids. Two different
test problems showcase our proposed automatic generation of multigrid solvers
for both CPU and GPU target platforms
- …