139 research outputs found
CLBlast: A Tuned OpenCL BLAS Library
This work introduces CLBlast, an open-source BLAS library providing optimized
OpenCL routines to accelerate dense linear algebra for a wide variety of
devices. It is targeted at machine learning and HPC applications and thus
provides a fast matrix-multiplication routine (GEMM) to accelerate the core of
many applications (e.g. deep learning, iterative solvers, astrophysics,
computational fluid dynamics, quantum chemistry). CLBlast has five main
advantages over other OpenCL BLAS libraries: 1) it is optimized for and tested
on a large variety of OpenCL devices including less commonly used devices such
as embedded and low-power GPUs, 2) it can be explicitly tuned for specific
problem-sizes on specific hardware platforms, 3) it can perform operations in
half-precision floating-point FP16 saving bandwidth, time and energy, 4) it has
an optional CUDA back-end, 5) and it can combine multiple operations in a
single batched routine, accelerating smaller problems significantly. This paper
describes the library and demonstrates the advantages of CLBlast experimentally
for different use-cases on a wide variety of OpenCL hardware.Comment: Conference paper in: IWOCL '18, the International Workshop on OpenC
Accelerated Neural Networks on OpenCL Devices Using SYCL-DNN
Over the past few years machine learning has seen a renewed explosion of
interest, following a number of studies showing the effectiveness of neural
networks in a range of tasks which had previously been considered incredibly
hard. Neural networks' effectiveness in the fields of image recognition and
natural language processing stems primarily from the vast amounts of data
available to companies and researchers, coupled with the huge amounts of
compute power available in modern accelerators such as GPUs, FPGAs and ASICs.
There are a number of approaches available to developers for utilizing GPGPU
technologies such as SYCL, OpenCL and CUDA, however many applications require
the same low level mathematical routines. Libraries dedicated to accelerating
these common routines allow developers to easily make full use of the available
hardware without requiring low level knowledge of the hardware themselves,
however such libraries are often provided by hardware manufacturers for
specific hardware such as cuDNN for Nvidia hardware or MIOpen for AMD hardware.
SYCL-DNN is a new open-source library dedicated to providing accelerated
routines for neural network operations which are hardware and vendor agnostic.
Built on top of the SYCL open standard and written entirely in standard C++,
SYCL-DNN allows a user to easily accelerate neural network code for a wide
range of hardware using a modern C++ interface. The library is tested on AMD's
OpenCL for GPU, Intel's OpenCL for CPU and GPU, ARM's OpenCL for Mali GPUs as
well as ComputeAorta's OpenCL for R-Car CV engine and host CPU. In this talk we
will present performance figures for SYCL-DNN on this range of hardware, and
discuss how high performance was achieved on such a varied set of accelerators
with such different hardware features.Comment: 4 pages, 3 figures. In International Workshop on OpenCL (IWOCL '19),
May 13-15, 2019, Bosto
HALO 1.0: A Hardware-agnostic Accelerator Orchestration Framework for Enabling Hardware-agnostic Programming with True Performance Portability for Heterogeneous HPC
This paper presents HALO 1.0, an open-ended extensible multi-agent software
framework that implements a set of proposed hardware-agnostic accelerator
orchestration (HALO) principles. HALO implements a novel compute-centric
message passing interface (C^2MPI) specification for enabling the
performance-portable execution of a hardware-agnostic host application across
heterogeneous accelerators. The experiment results of evaluating eight widely
used HPC subroutines based on Intel Xeon E5-2620 CPUs, Intel Arria 10 GX FPGAs,
and NVIDIA GeForce RTX 2080 Ti GPUs show that HALO 1.0 allows for a unified
control flow for host programs to run across all the computing devices with a
consistently top performance portability score, which is up to five orders of
magnitude higher than the OpenCL-based solution.Comment: 21 page
FLASH 1.0: A Software Framework for Rapid Parallel Deployment and Enhancing Host Code Portability in Heterogeneous Computing
In this paper, we present FLASH 1.0, a C++-based software framework for rapid
parallel deployment and enhancing host code portability in heterogeneous
computing. FLASH takes a novel approach in describing kernels and dynamically
dispatching them in a hardware-agnostic manner. FLASH features truly
hardware-agnostic frontend interfaces, which not only unify the compile-time
control flow but also enforces a portability-optimized code organization that
imposes a demarcation between computational (performance-critical) and
functional (non-performance-critical) codes as well as the separation of
hardware-specific and hardware-agnostic codes in the host application. We use
static code analysis to measure the hardware independence ratio of popular HPC
applications and show that up to 99.72% code portability can be achieved with
FLASH. Similarly, we measure the complexity of state-of-the-art portable
programming models and show that a code reduction of up to 2.2x can be achieved
for two common HPC kernels while maintaining 100% code portability with a
normalized framework overhead between 1% - 13% of the total kernel runtime. The
codes are available at https://github.com/PSCLab-ASU/FLASH.Comment: 12 page
Comparing Performance and Portability between CUDA and SYCL for Protein Database Search on NVIDIA, AMD, and Intel GPUs
The heterogeneous computing paradigm has led to the need for portable and
efficient programming solutions that can leverage the capabilities of various
hardware devices, such as NVIDIA, Intel, and AMD GPUs. This study evaluates the
portability and performance of the SYCL and CUDA languages for one fundamental
bioinformatics application (Smith-Waterman protein database search) across
different GPU architectures, considering single and multi-GPU configurations
from different vendors. The experimental work showed that, while both CUDA and
SYCL versions achieve similar performance on NVIDIA devices, the latter
demonstrated remarkable code portability to other GPU architectures, such as
AMD and Intel. Furthermore, the architectural efficiency rates achieved on
these devices were superior in 3 of the 4 cases tested. This brief study
highlights the potential of SYCL as a viable solution for achieving both
performance and portability in the heterogeneous computing ecosystem.Comment: This article was accepted for publication in 2023 IEEE 35th
International Symposium on Computer Architecture and High Performance
Computing (SBAC-PAD
- …