644 research outputs found
An investigation of the performance portability of OpenCL
This paper reports on the development of an MPI/OpenCL implementation of LU, an application-level benchmark from the NAS Parallel Benchmark Suite. An account of the design decisions addressed during the development of this code is presented, demonstrating the importance of memory arrangement and work-item/work-group distribution strategies when applications are deployed on different device types. The resulting platform-agnostic, single source application is benchmarked on a number of different architectures, and is shown to be 1.3–1.5× slower than native FORTRAN 77 or CUDA implementations on a single node and 1.3–3.1× slower on multiple nodes. We also explore the potential performance gains of OpenCL’s device fissioning capability, demonstrating up to a 3× speed-up over our original OpenCL implementation
Developing performance-portable molecular dynamics kernels in Open CL
This paper investigates the development of a molecular dynamics code that is highly portable between architectures. Using OpenCL, we develop an implementation of Sandia’s miniMD benchmark that achieves good levels of performance across a wide range of hardware: CPUs, discrete GPUs and integrated GPUs.
We demonstrate that the performance bottlenecks of miniMD’s short-range force calculation kernel are the same across these architectures, and detail a number of platform- agnostic optimisations that improve its performance by at least 2x on all hardware considered. Our complete code is shown to be 1.7x faster than the original miniMD, and at most 2x slower than implementations individually hand-tuned for a specific architecture
Machine Learning Based Auto-tuning for Enhanced OpenCL Performance Portability
Heterogeneous computing, which combines devices with different architectures,
is rising in popularity, and promises increased performance combined with
reduced energy consumption. OpenCL has been proposed as a standard for
programing such systems, and offers functional portability. It does, however,
suffer from poor performance portability, code tuned for one device must be
re-tuned to achieve good performance on another device. In this paper, we use
machine learning-based auto-tuning to address this problem. Benchmarks are run
on a random subset of the entire tuning parameter configuration space, and the
results are used to build an artificial neural network based model. The model
can then be used to find interesting parts of the parameter space for further
search. We evaluate our method with different benchmarks, on several devices,
including an Intel i7 3770 CPU, an Nvidia K40 GPU and an AMD Radeon HD 7970
GPU. Our model achieves a mean relative error as low as 6.1%, and is able to
find configurations as little as 1.3% worse than the global minimum.Comment: This is a pre-print version an article to be published in the
Proceedings of the 2015 IEEE International Parallel and Distributed
Processing Symposium Workshops (IPDPSW). For personal use onl
Towards a portable and future-proof particle-in-cell plasma physics code
We present the first reported OpenCL implementation of EPOCH3D, an extensible particle-in-cell plasma physics code developed at the University of Warwick. We document the challenges and successes of this porting effort, and compare the performance of our implementation executing on a wide variety of hardware from multiple vendors. The focus of our work is on understanding the suitability of existing algorithms for future accelerator-based architectures, and identifying the changes necessary to achieve performance portability for particle-in-cell plasma physics codes.
We achieve good levels of performance with limited changes to the algorithmic behaviour of the code. However, our results suggest that a fundamental change to EPOCH3D’s current accumulation step (and its dependency on atomic operations) is necessary in order to fully utilise the massive levels of parallelism supported by emerging parallel architectures
Performance Portability Study of Linear Algebra Kernels in OpenCL
The performance portability of OpenCL kernel implementations for common
memory bandwidth limited linear algebra operations across different hardware
generations of the same vendor as well as across vendors is studied. Certain
combinations of kernel implementations and work sizes are found to exhibit good
performance across compute kernels, hardware generations, and, to a lesser
degree, vendors. As a consequence, it is demonstrated that the optimization of
a single kernel is often sufficient to obtain good performance for a large
class of more complicated operations.Comment: 11 pages, 8 figures, 2 tables, International Workshop on OpenCL 201
- …