675 research outputs found
Mixing multi-core CPUs and GPUs for scientific simulation software
Recent technological and economic developments have led to widespread availability of
multi-core CPUs and specialist accelerator processors such as graphical processing units
(GPUs). The accelerated computational performance possible from these devices can be very
high for some applications paradigms. Software languages and systems such as NVIDIA's
CUDA and Khronos consortium's open compute language (OpenCL) support a number of
individual parallel application programming paradigms. To scale up the performance of some
complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and
very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica-
tions using threading approaches and multi-core CPUs to control independent GPU devices.
We present speed-up data and discuss multi-threading software issues for the applications
level programmer and o er some suggested areas for language development and integration
between coarse-grained and ne-grained multi-thread systems. We discuss results from three
common simulation algorithmic areas including: partial di erential equations; graph cluster
metric calculations and random number generation. We report on programming experiences
and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs;
a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and
trends in multi-core programming for scienti c applications developers
Performance analysis and optimization of automotive GPUs
© 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Advanced Driver Assistance Systems (ADAS) and Autonomous Driving (AD) have drastically increased the performance demands of automotive systems. Suitable highperformance platforms building upon Graphic Processing Units (GPUs) have been developed to respond to this demand, being NVIDIA Jetson TX2 a relevant representative. However, whether high-performance GPU configurations are appropriate for automotive setups remains as an open question. This paper aims at providing light on this question by modelling an automotive GPU (Jetson TX2), analyzing its microarchitectural parameters against relevant benchmarks, and identifying specific configurations able to meaningfully increase performance within similar cost envelopes, or to decrease costs preserving original performance levels. Overall, our analysis opens the door to the optimization of automotive GPUs for further system efficiency.This work has been partially supported by the Spanish
Ministry of Economy and Competitiveness (MINECO) under grant TIN2015-65316-P, the European Research Council
(ERC) under the European Union’s Horizon 2020 research
and innovation programme (grant agreement No. 772773) and
the HiPEAC Network of Excellence. Pedro Benedicte and
Jaume Abella have been partially supported by the MINECO
under FPU15/01394 grant and Ramon y Cajal postdoctoral fellowship number RYC-2013-14717 respectively and Leonidas
Kosmidis under Juan de la Cierva-Formacin postdoctoral fellowship (FJCI-2017-34095).Peer ReviewedPostprint (author's final draft
Performance Analysis of a Novel GPU Computation-to-core Mapping Scheme for Robust Facet Image Modeling
Though the GPGPU concept is well-known
in image processing, much more work remains to be done
to fully exploit GPUs as an alternative computation
engine. This paper investigates the computation-to-core
mapping strategies to probe the efficiency and scalability
of the robust facet image modeling algorithm on GPUs.
Our fine-grained computation-to-core mapping scheme
shows a significant performance gain over the standard
pixel-wise mapping scheme. With in-depth performance
comparisons across the two different mapping schemes,
we analyze the impact of the level of parallelism on
the GPU computation and suggest two principles for
optimizing future image processing applications on the
GPU platform
Performance Engineering for Real and Complex Tall & Skinny Matrix Multiplication Kernels on GPUs
General matrix-matrix multiplications with double-precision real and complex
entries (DGEMM and ZGEMM) in vendor-supplied BLAS libraries are best optimized
for square matrices but often show bad performance for tall & skinny matrices,
which are much taller than wide. NVIDIA's current CUBLAS implementation
delivers only a fraction of the potential performance as indicated by the
roofline model in this case. We describe the challenges and key characteristics
of an implementation that can achieve close to optimal performance. We further
evaluate different strategies of parallelization and thread distribution, and
devise a flexible, configurable mapping scheme. To ensure flexibility and allow
for highly tailored implementations we use code generation combined with
autotuning. For a large range of matrix sizes in the domain of interest we
achieve at least 2/3 of the roofline performance and often substantially
outperform state-of-the art CUBLAS results on an NVIDIA Volta GPGPU.Comment: 12 pages, 22 figures. Extended version of arXiv:1905.03136v1 for
journal submissio
Contract-Based General-Purpose GPU Programming
Using GPUs as general-purpose processors has revolutionized parallel
computing by offering, for a large and growing set of algorithms, massive
data-parallelization on desktop machines. An obstacle to widespread adoption,
however, is the difficulty of programming them and the low-level control of the
hardware required to achieve good performance. This paper suggests a
programming library, SafeGPU, that aims at striking a balance between
programmer productivity and performance, by making GPU data-parallel operations
accessible from within a classical object-oriented programming language. The
solution is integrated with the design-by-contract approach, which increases
confidence in functional program correctness by embedding executable program
specifications into the program text. We show that our library leads to modular
and maintainable code that is accessible to GPGPU non-experts, while providing
performance that is comparable with hand-written CUDA code. Furthermore,
runtime contract checking turns out to be feasible, as the contracts can be
executed on the GPU
Sparse matrix-vector multiplication on GPGPUs
The multiplication of a sparse matrix by a dense vector (SpMV) is a centerpiece of scientific computing applications: it is the essential kernel for the solution of sparse linear systems and sparse eigenvalue problems by iterative methods. The efficient implementation of the sparse matrix-vector multiplication is therefore crucial and has been the subject of an immense amount of research, with interest renewed with every major new trend in high performance computing architectures. The introduction of General Purpose Graphics Processing Units (GPGPUs) is no exception, and many articles have been devoted to this problem. With this paper we provide a review of the techniques for implementing the SpMV kernel on GPGPUs that have appeared in the literature of the last few years. We discuss the issues and trade-offs that have been encountered by the various researchers, and a list of solutions, organized in categories according to common features. We also provide a performance comparison across different GPGPU models and on a set of test matrices coming from various application domains
A Massive Data Parallel Computational Framework for Petascale/Exascale Hybrid Computer Systems
Heterogeneous systems are becoming more common on High Performance Computing
(HPC) systems. Even using tools like CUDA and OpenCL it is a non-trivial task
to obtain optimal performance on the GPU. Approaches to simplifying this task
include Merge (a library based framework for heterogeneous multi-core systems),
Zippy (a framework for parallel execution of codes on multiple GPUs), BSGP (a
new programming language for general purpose computation on the GPU) and
CUDA-lite (an enhancement to CUDA that transforms code based on annotations).
In addition, efforts are underway to improve compiler tools for automatic
parallelization and optimization of affine loop nests for GPUs and for
automatic translation of OpenMP parallelized codes to CUDA.
In this paper we present an alternative approach: a new computational
framework for the development of massively data parallel scientific codes
applications suitable for use on such petascale/exascale hybrid systems built
upon the highly scalable Cactus framework. As the first non-trivial
demonstration of its usefulness, we successfully developed a new 3D CFD code
that achieves improved performance.Comment: Parallel Computing 2011 (ParCo2011), 30 August -- 2 September 2011,
Ghent, Belgiu
- …