346 research outputs found
CLBlast: A Tuned OpenCL BLAS Library
This work introduces CLBlast, an open-source BLAS library providing optimized
OpenCL routines to accelerate dense linear algebra for a wide variety of
devices. It is targeted at machine learning and HPC applications and thus
provides a fast matrix-multiplication routine (GEMM) to accelerate the core of
many applications (e.g. deep learning, iterative solvers, astrophysics,
computational fluid dynamics, quantum chemistry). CLBlast has five main
advantages over other OpenCL BLAS libraries: 1) it is optimized for and tested
on a large variety of OpenCL devices including less commonly used devices such
as embedded and low-power GPUs, 2) it can be explicitly tuned for specific
problem-sizes on specific hardware platforms, 3) it can perform operations in
half-precision floating-point FP16 saving bandwidth, time and energy, 4) it has
an optional CUDA back-end, 5) and it can combine multiple operations in a
single batched routine, accelerating smaller problems significantly. This paper
describes the library and demonstrates the advantages of CLBlast experimentally
for different use-cases on a wide variety of OpenCL hardware.Comment: Conference paper in: IWOCL '18, the International Workshop on OpenC
Machine Learning Based Auto-tuning for Enhanced OpenCL Performance Portability
Heterogeneous computing, which combines devices with different architectures,
is rising in popularity, and promises increased performance combined with
reduced energy consumption. OpenCL has been proposed as a standard for
programing such systems, and offers functional portability. It does, however,
suffer from poor performance portability, code tuned for one device must be
re-tuned to achieve good performance on another device. In this paper, we use
machine learning-based auto-tuning to address this problem. Benchmarks are run
on a random subset of the entire tuning parameter configuration space, and the
results are used to build an artificial neural network based model. The model
can then be used to find interesting parts of the parameter space for further
search. We evaluate our method with different benchmarks, on several devices,
including an Intel i7 3770 CPU, an Nvidia K40 GPU and an AMD Radeon HD 7970
GPU. Our model achieves a mean relative error as low as 6.1%, and is able to
find configurations as little as 1.3% worse than the global minimum.Comment: This is a pre-print version an article to be published in the
Proceedings of the 2015 IEEE International Parallel and Distributed
Processing Symposium Workshops (IPDPSW). For personal use onl
Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes
The ongoing hardware evolution exhibits an escalation in the number, as well
as in the heterogeneity, of computing resources. The pressure to maintain
reasonable levels of performance and portability forces application developers
to leave the traditional programming paradigms and explore alternative
solutions. PaStiX is a parallel sparse direct solver, based on a dynamic
scheduler for modern hierarchical manycore architectures. In this paper, we
study the benefits and limits of replacing the highly specialized internal
scheduler of the PaStiX solver with two generic runtime systems: PaRSEC and
StarPU. The tasks graph of the factorization step is made available to the two
runtimes, providing them the opportunity to process and optimize its traversal
in order to maximize the algorithm efficiency for the targeted hardware
platform. A comparative study of the performance of the PaStiX solver on top of
its native internal scheduler, PaRSEC, and StarPU frameworks, on different
execution environments, is performed. The analysis highlights that these
generic task-based runtimes achieve comparable results to the
application-optimized embedded scheduler on homogeneous platforms. Furthermore,
they are able to significantly speed up the solver on heterogeneous
environments by taking advantage of the accelerators while hiding the
complexity of their efficient manipulation from the programmer.Comment: Heterogeneity in Computing Workshop (2014
- …