319 research outputs found
Transformations of High-Level Synthesis Codes for High-Performance Computing
Specialized hardware architectures promise a major step in performance and
energy efficiency over the traditional load/store devices currently employed in
large scale computing systems. The adoption of high-level synthesis (HLS) from
languages such as C/C++ and OpenCL has greatly increased programmer
productivity when designing for such platforms. While this has enabled a wider
audience to target specialized hardware, the optimization principles known from
traditional software design are no longer sufficient to implement
high-performance codes. Fast and efficient codes for reconfigurable platforms
are thus still challenging to design. To alleviate this, we present a set of
optimizing transformations for HLS, targeting scalable and efficient
architectures for high-performance computing (HPC) applications. Our work
provides a toolbox for developers, where we systematically identify classes of
transformations, the characteristics of their effect on the HLS code and the
resulting hardware (e.g., increases data reuse or resource consumption), and
the objectives that each transformation can target (e.g., resolve interface
contention, or increase parallelism). We show how these can be used to
efficiently exploit pipelining, on-chip distributed fast memory, and on-chip
streaming dataflow, allowing for massively parallel architectures. To quantify
the effect of our transformations, we use them to optimize a set of
throughput-oriented FPGA kernels, demonstrating that our enhancements are
sufficient to scale up parallelism within the hardware constraints. With the
transformations covered, we hope to establish a common framework for
performance engineers, compiler developers, and hardware developers, to tap
into the performance potential offered by specialized hardware architectures
using HLS
Multi-GPU support on the marrow algorithmic skeleton framework
Dissertação para obtenção do Grau de Mestre em
Engenharia InformáticaWith the proliferation of general purpose GPUs, workload parallelization and datatransfer optimization became an increasing concern. The natural evolution from using a single GPU, is multiplying the amount of available processors, presenting new challenges, as tuning the workload decompositions and load balancing, when dealing with heterogeneous systems.
Higher-level programming is a very important asset in a multi-GPU environment, due to the complexity inherent to the currently used GPGPU APIs (OpenCL and CUDA), because of their low-level and code overhead. This can be obtained by introducing an abstraction layer, which has the advantage of enabling implicit optimizations and orchestrations
such as transparent load balancing mechanism and reduced explicit code overhead.
Algorithmic Skeletons, previously used in cluster environments, have recently been
adapted to the GPGPU context. Skeletons abstract most sources of code overhead, by
defining computation patterns of commonly used algorithms. The Marrow algorithmic
skeleton library is one of these, taking advantage of the abstractions to automate the
orchestration needed for an efficient GPU execution.
This thesis proposes the extension of Marrow to leverage the use of algorithmic skeletons
in the modular and efficient programming of multiple heterogeneous GPUs, within a single machine.
We were able to achieve a good balance between simplicity of the programming model and performance, obtaining good scalability when using multiple GPUs, with an efficient load distribution, although at the price of some overhead when using a single-GPU.projects PTDC/EIA-EIA/102579/2008 and PTDC/EIA-EIA/111518/200
Design of OpenCL-compatible multithreaded hardware accelerators with dynamic support for embedded FPGAs
ARTICo3 is an architecture that permits to dynamically set an arbitrary number of reconfigurable hardware accelerators, each containing a given number of threads fixed at design time according to High Level Synthesis constraints. However, the replication of these modules can be decided at runtime to accelerate kernels by increasing the overall number of threads, add modular redundancy to increase fault tolerance, or any combination of the previous. An execution scheduler is used at kernel invocation to deliver the appropriate data transfers, optimizing memory transactions, and sequencing or parallelizing execution according to the configuration specified by the resource manager of the architecture. The model of computation is compatible with the OpenCL kernel execution model, and memory transfers and architecture are arranged to match the same optimization criteria as for kernel execution in GPU architectures but, differently to other approaches, with dynamic hardware execution support. In this paper, a novel design methodology for multithreaded hardware accelerators is presented. The proposed framework provides OpenCL compatibility by implementing a memory model based on shared memory between host and compute device, which removes the overhead imposed by data transferences at global memory level, and local memories inside each accelerator, i.e. compute unit, which are connected to global memory through optimized DMA links. These local memories provide unified access, i.e. a continuous memory map, from the host side, but are divided in a configurable number of independent banks (to increase available ports) from the processing elements side to fully exploit data-level parallelism. Experimental results show OpenCL model compliance using multithreaded hardware accelerators and enhanced dynamic adaptation capabilities
pocl: A Performance-Portable OpenCL Implementation
OpenCL is a standard for parallel programming of heterogeneous systems. The
benefits of a common programming standard are clear; multiple vendors can
provide support for application descriptions written according to the standard,
thus reducing the program porting effort. While the standard brings the obvious
benefits of platform portability, the performance portability aspects are
largely left to the programmer. The situation is made worse due to multiple
proprietary vendor implementations with different characteristics, and, thus,
required optimization strategies.
In this paper, we propose an OpenCL implementation that is both portable and
performance portable. At its core is a kernel compiler that can be used to
exploit the data parallelism of OpenCL programs on multiple platforms with
different parallel hardware styles. The kernel compiler is modularized to
perform target-independent parallel region formation separately from the
target-specific parallel mapping of the regions to enable support for various
styles of fine-grained parallel resources such as subword SIMD extensions, SIMD
datapaths and static multi-issue. Unlike previous similar techniques that work
on the source level, the parallel region formation retains the information of
the data parallelism using the LLVM IR and its metadata infrastructure. This
data can be exploited by the later generic compiler passes for efficient
parallelization.
The proposed open source implementation of OpenCL is also platform portable,
enabling OpenCL on a wide range of architectures, both already commercialized
and on those that are still under research. The paper describes how the
portability of the implementation is achieved. Our results show that most of
the benchmarked applications when compiled using pocl were faster or close to
as fast as the best proprietary OpenCL implementation for the platform at hand.Comment: This article was published in 2015; it is now openly accessible via
arxi
OpenCL의 프로그래밍 용이성 향상 기법
학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2016. 2. 이재진.OpenCL is one of the major programming models for heterogeneous systems. This thesis presents two limitations of OpenCL, the complicated nature of programming in OpenCL and the lack of support for a heterogeneous cluster, and proposes a solution for each of them for ease of programming.
The first limitation is that it is complicated to write a program using OpenCL. In order to lower this programming complexity, this thesis proposes a framework that translates a program written in a high-level language (OpenMP) to OpenCL at the source level. This thesis achieves both ease of programming and high performance by employing two techniquesdata transfer minimization (DTM) and performance portability enhancement (PPE). This thesis shows the effectiveness of the proposed translation framework by evaluating benchmark applications and the practicality by comparing it with the commercial PGI compiler.
The second limitation of OpenCL is the lack of support for a heterogeneous cluster. In order to extend OpenCL to a heterogeneous cluster, this thesis proposes a framework called SnuCL-D that is able to execute a program written only in OpenCL on a heterogeneous cluster. Unlike previous approaches that apply a centralized approach, the proposed framework applies a decentralized approach, which gives a chance to reduce three kinds of overhead occurring in the execution path of commands.
With the ability to analyze and reduce three kinds of overhead, the proposed framework shows good scalability for a large-scale cluster system. The proposed framework proves its effectiveness and practicality by compared to the representative centralized approach (SnuCL) and MPI with benchmark applications.
This thesis proposes solutions for the two limitations of OpenCL for ease of programming on heterogeneous clusters. It is expected that application developers will be able to easily execute not only an OpenMP program on various accelerators but also a program written only in OpenCL on a heterogeneous cluster.Chapter I. Introduction 1
I.1 Motivation and Objectives 5
I.1.1 Programming Complexity 5
I.1.2 Lack of Support for a Heterogeneous Cluster 8
I.2 Contributions 12
Chapter II. Background and Related Work 15
II.1 Background 15
II.1.1 OpenCL 16
II.1.2 OpenMP 23
II.2 Related Work 26
II.2.1 Programming Complexity 26
II.2.2 Support for a Heterogeneous Cluster 29
Chapter III. Lowering the Programming Complexity 34
III.1 Motivating Example 35
III.1.1 Device Constructs 35
III.1.2 Needs for Data Transfer Optimization 41
III.2 Mapping OpenMP to OpenCL 44
III.2.1 Architecture Model 44
III.2.2 Execution Model 45
III.3 Code Translation 46
III.3.1 Translation Process 46
III.3.2 Translating OpenMP to OpenCL 48
III.3.3 Example of Code Translation 50
III.3.4 Data Transfer Minimization (DTM) 62
III.3.5 Performance Portability Enhancement (PPE) 66
III.4 Performance Evaluation 69
III.4.1 Evaluation Methodology 70
III.4.2 Effectiveness of Optimization Techniques 74
III.4.3 Comparison with Other Implementations 79
Chapter IV. Support for a Heterogeneous Cluster 90
IV.1 Problems of Previous Approaches 90
IV.2 The Approach of SnuCL-D 91
IV.2.1 Overhead Analysis 93
IV.2.2 Remote Device Virtualization 94
IV.2.3 Redundant Computation and Data Replication 95
IV.2.4 Memory-read Commands 97
IV.3 Consistency Management 98
IV.4 Deterministic Command Scheduling 100
IV.5 New API Function: clAttachBufferToDevice() 103
IV.6 Queueing Optimization 104
IV.7 Performance Evaluation 105
IV.7.1 Evaluation Methodology 105
IV.7.2 Evaluation with a Microbenchmark 109
IV.7.3 Evaluation on the Large-scale CPU Cluster 111
IV.7.4 Evaluation on the Medium-scale GPU Cluster 123
Chapter V. Conclusion and Future Work 125
Bibliography 129
Korean Abstract 140Docto
Python FPGA Programming with Data-Centric Multi-Level Design
Although high-level synthesis (HLS) tools have significantly improved
programmer productivity over hardware description languages, developing for
FPGAs remains tedious and error prone. Programmers must learn and implement a
large set of vendor-specific syntax, patterns, and tricks to optimize (or even
successfully compile) their applications, while dealing with ever-changing
toolflows from the FPGA vendors. We propose a new way to develop, optimize, and
compile FPGA programs. The Data-Centric parallel programming (DaCe) framework
allows applications to be defined by their dataflow and control flow through
the Stateful DataFlow multiGraph (SDFG) representation, capturing the abstract
program characteristics, and exposing a plethora of optimization opportunities.
In this work, we show how extending SDFGs with multi-level Library Nodes
incorporates both domain-specific and platform-specific optimizations into the
design flow, enabling knowledge transfer across application domains and FPGA
vendors. We present the HLS-based FPGA code generation backend of DaCe, and
show how SDFGs are code generated for either FPGA vendor, emitting efficient
HLS code that is structured and annotated to implement the desired
architecture
- …