8,145 research outputs found
Design and optimization of a portable LQCD Monte Carlo code using OpenACC
The present panorama of HPC architectures is extremely heterogeneous, ranging
from traditional multi-core CPU processors, supporting a wide class of
applications but delivering moderate computing performance, to many-core GPUs,
exploiting aggressive data-parallelism and delivering higher performances for
streaming computing applications. In this scenario, code portability (and
performance portability) become necessary for easy maintainability of
applications; this is very relevant in scientific computing where code changes
are very frequent, making it tedious and prone to error to keep different code
versions aligned. In this work we present the design and optimization of a
state-of-the-art production-level LQCD Monte Carlo application, using the
directive-based OpenACC programming model. OpenACC abstracts parallel
programming to a descriptive level, relieving programmers from specifying how
codes should be mapped onto the target architecture. We describe the
implementation of a code fully written in OpenACC, and show that we are able to
target several different architectures, including state-of-the-art traditional
CPUs and GPUs, with the same code. We also measure performance, evaluating the
computing efficiency of our OpenACC code on several architectures, comparing
with GPU-specific implementations and showing that a good level of
performance-portability can be reached.Comment: 26 pages, 2 png figures, preprint of an article submitted for
consideration in International Journal of Modern Physics
Mixing multi-core CPUs and GPUs for scientific simulation software
Recent technological and economic developments have led to widespread availability of
multi-core CPUs and specialist accelerator processors such as graphical processing units
(GPUs). The accelerated computational performance possible from these devices can be very
high for some applications paradigms. Software languages and systems such as NVIDIA's
CUDA and Khronos consortium's open compute language (OpenCL) support a number of
individual parallel application programming paradigms. To scale up the performance of some
complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and
very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica-
tions using threading approaches and multi-core CPUs to control independent GPU devices.
We present speed-up data and discuss multi-threading software issues for the applications
level programmer and o er some suggested areas for language development and integration
between coarse-grained and ne-grained multi-thread systems. We discuss results from three
common simulation algorithmic areas including: partial di erential equations; graph cluster
metric calculations and random number generation. We report on programming experiences
and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs;
a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and
trends in multi-core programming for scienti c applications developers
FASTCUDA: Open Source FPGA Accelerator & Hardware-Software Codesign Toolset for CUDA Kernels
Using FPGAs as hardware accelerators that communicate with a central CPU is becoming a common practice in the embedded design world but there is no standard methodology and toolset to facilitate this path yet. On the other hand, languages such as CUDA and OpenCL provide standard development environments for Graphical Processing Unit (GPU) programming. FASTCUDA is a platform that provides the necessary software toolset, hardware architecture, and design methodology to efficiently adapt the CUDA approach into a new FPGA design flow. With FASTCUDA, the CUDA kernels of a CUDA-based application are partitioned into two groups with minimal user intervention: those that are compiled and executed in parallel software, and those that are synthesized and implemented in hardware. A modern low power FPGA can provide the processing power (via numerous embedded micro-CPUs) and the logic capacity for both the software and hardware implementations of the CUDA kernels. This paper describes the system requirements and the architectural decisions behind the FASTCUDA approach
Instruction-Level Abstraction (ILA): A Uniform Specification for System-on-Chip (SoC) Verification
Modern Systems-on-Chip (SoC) designs are increasingly heterogeneous and
contain specialized semi-programmable accelerators in addition to programmable
processors. In contrast to the pre-accelerator era, when the ISA played an
important role in verification by enabling a clean separation of concerns
between software and hardware, verification of these "accelerator-rich" SoCs
presents new challenges. From the perspective of hardware designers, there is a
lack of a common framework for the formal functional specification of
accelerator behavior. From the perspective of software developers, there exists
no unified framework for reasoning about software/hardware interactions of
programs that interact with accelerators. This paper addresses these challenges
by providing a formal specification and high-level abstraction for accelerator
functional behavior. It formalizes the concept of an Instruction Level
Abstraction (ILA), developed informally in our previous work, and shows its
application in modeling and verification of accelerators. This formal ILA
extends the familiar notion of instructions to accelerators and provides a
uniform, modular, and hierarchical abstraction for modeling software-visible
behavior of both accelerators and programmable processors. We demonstrate the
applicability of the ILA through several case studies of accelerators (for
image processing, machine learning, and cryptography), and a general-purpose
processor (RISC-V). We show how the ILA model facilitates equivalence checking
between two ILAs, and between an ILA and its hardware finite-state machine
(FSM) implementation. Further, this equivalence checking supports accelerator
upgrades using the notion of ILA compatibility, similar to processor upgrades
using ISA compatibility.Comment: 24 pages, 3 figures, 3 table
An experimental study on performance portability of OpenCL kernels
Accelerator processors allow energy-efficient computation at high performance, especially for computationintensive applications. There exists a plethora of different accelerator architectures, such as GPUs and the Cell Broadband Engine. Each accelerator has its own programming language, but the recently introduced OpenCL language unifies accelerator programming languages. Hereby, OpenCL achieves functional protability, allowing to reduce the development time of kernels. Functional portability however has limited value without performance portability: the possibility to re-use optimized kernels with good performance. This paper investigates the specificity of code optimizations to accelerator architecture and the severity of lack of performance portability
- …