16 research outputs found
CoreTSAR: Task Scheduling for Accelerator-aware Runtimes
Heterogeneous supercomputers that incorporate computational accelerators
such as GPUs are increasingly popular due to their high
peak performance, energy efficiency and comparatively low cost.
Unfortunately, the programming models and frameworks designed
to extract performance from all computational units still lack the
flexibility of their CPU-only counterparts. Accelerated OpenMP
improves this situation by supporting natural migration of OpenMP
code from CPUs to a GPU. However, these implementations currently
lose one of OpenMP’s best features, its flexibility: typical
OpenMP applications can run on any number of CPUs. GPU implementations
do not transparently employ multiple GPUs on a node
or a mix of GPUs and CPUs. To address these shortcomings, we
present CoreTSAR, our runtime library for dynamically scheduling
tasks across heterogeneous resources, and propose straightforward
extensions that incorporate this functionality into Accelerated
OpenMP. We show that our approach can provide nearly linear
speedup to four GPUs over only using CPUs or one GPU while
increasing the overall flexibility of Accelerated OpenMP
Accelerating Face Anti-Spoofing Algorithms
Tato práce se specializuje na akceleraci algoritmu z oblasti obličejově zaměřených anti-spoofing algoritmů s využitím grafického hardware jakožto platformy pro paralelní zpracování dat. Jako framework je použita technologie OpenCL která umožňuje použití od výkoných stolních počítačů po přenosná zařízení, od různých akcelerátorů jako grafické čipy, či ASIC až po procesory typu x86 bez vazby na konkrétního výrobce či operační systém. Autor předkládá čtenáři rozbor a akcelerovanou implementaci široce používaného algoritmu a dopadu urychlení výpočtu.This thesis is specializes on algorithm acceleration from the field of face-based anti-spoofing. Graphics hardware is used as platform for data-parallel processing. As framework, the OpenCL is used. It allows execution on devices such as powerful desktop computers or hand-held devices as well as usage of different kind of processing units such as GPU, ASIC or CPU without any bound to hardware vendor or operating system. Author presents to reader analysis and accelerated implementation of widely used algorithm and impact of such improvement in execution time.
Programming the Adapteva Epiphany 64-core Network-on-chip Coprocessor
In the construction of exascale computing systems energy efficiency and power
consumption are two of the major challenges. Low-power high performance
embedded systems are of increasing interest as building blocks for large scale
high- performance systems. However, extracting maximum performance out of such
systems presents many challenges. Various aspects from the hardware
architecture to the programming models used need to be explored. The Epiphany
architecture integrates low-power RISC cores on a 2D mesh network and promises
up to 70 GFLOPS/Watt of processing efficiency. However, with just 32 KB of
memory per eCore for storing both data and code, and only low level inter-core
communication support, programming the Epiphany system presents several
challenges. In this paper we evaluate the performance of the Epiphany system
for a variety of basic compute and communication operations. Guided by this
data we explore strategies for implementing scientific applications on memory
constrained low-powered devices such as the Epiphany. With future systems
expected to house thousands of cores in a single chip, the merits of such
architectures as a path to exascale is compared to other competing systems.Comment: 14 pages, submitted to IJHPCA Journal special editio
Data Transfers Analysis in Computer Assisted Design Flow of FPGA Accelerators for Aerospace Systems
The integration of Field Programmable Gate Arrays (FPGAs) in an aerospace system improves its efficiency and its flexibility thanks to their programmability, but increases the design complexity. The design flows indeed have to be composed of several steps to fill the gap between the starting solution, which is usually a reference sequential implementation, and the final heterogeneous solution which includes custom hardware accelerators. Among these steps, there are the analysis of the application to identify the functionalities that gain advantages in execution on hardware and the generation of their implementations by means of Hardware Description Languages. Generating these descriptions for a software developer can be a very difficult task because of the different programming paradigms of software programs and hardware descriptions. To facilitate the developer in this activity, High Level Synthesis techniques have been developed aiming at (semi-)automatically generating hardware implementations of specifications written in high level languages (e.g., C). With respect to other embedded systems scenarios, the aerospace systems introduce further constraints that have to be taken into account during the design of these heterogeneous systems. In this type of systems explicit data transfers to and from FPGAs are preferred to the adoption of a shared memory architecture. The first approach indeed potentially improves the predictability of the produced solutions, but the sizes of all the data transferred to and from any devices must be known at design time. Identifying the sizes in presence of complex C applications which use pointers can be a not so easy task. In this paper, a semi-automatic design flow based on the integration of an aerospace design flow, an application analysis technique, and High Level Synthesis methodologies is presented. The initial reference application is analyzed to identify which are the sizes of the data exchanged among the different components of the application. Next, starting from the high level specification and from the results of this analysis, High Level Synthesis techniques are applied to automatically produce the hardware accelerators
OpenCL의 프로그래밍 용이성 향상 기법
학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2016. 2. 이재진.OpenCL is one of the major programming models for heterogeneous systems. This thesis presents two limitations of OpenCL, the complicated nature of programming in OpenCL and the lack of support for a heterogeneous cluster, and proposes a solution for each of them for ease of programming.
The first limitation is that it is complicated to write a program using OpenCL. In order to lower this programming complexity, this thesis proposes a framework that translates a program written in a high-level language (OpenMP) to OpenCL at the source level. This thesis achieves both ease of programming and high performance by employing two techniquesdata transfer minimization (DTM) and performance portability enhancement (PPE). This thesis shows the effectiveness of the proposed translation framework by evaluating benchmark applications and the practicality by comparing it with the commercial PGI compiler.
The second limitation of OpenCL is the lack of support for a heterogeneous cluster. In order to extend OpenCL to a heterogeneous cluster, this thesis proposes a framework called SnuCL-D that is able to execute a program written only in OpenCL on a heterogeneous cluster. Unlike previous approaches that apply a centralized approach, the proposed framework applies a decentralized approach, which gives a chance to reduce three kinds of overhead occurring in the execution path of commands.
With the ability to analyze and reduce three kinds of overhead, the proposed framework shows good scalability for a large-scale cluster system. The proposed framework proves its effectiveness and practicality by compared to the representative centralized approach (SnuCL) and MPI with benchmark applications.
This thesis proposes solutions for the two limitations of OpenCL for ease of programming on heterogeneous clusters. It is expected that application developers will be able to easily execute not only an OpenMP program on various accelerators but also a program written only in OpenCL on a heterogeneous cluster.Chapter I. Introduction 1
I.1 Motivation and Objectives 5
I.1.1 Programming Complexity 5
I.1.2 Lack of Support for a Heterogeneous Cluster 8
I.2 Contributions 12
Chapter II. Background and Related Work 15
II.1 Background 15
II.1.1 OpenCL 16
II.1.2 OpenMP 23
II.2 Related Work 26
II.2.1 Programming Complexity 26
II.2.2 Support for a Heterogeneous Cluster 29
Chapter III. Lowering the Programming Complexity 34
III.1 Motivating Example 35
III.1.1 Device Constructs 35
III.1.2 Needs for Data Transfer Optimization 41
III.2 Mapping OpenMP to OpenCL 44
III.2.1 Architecture Model 44
III.2.2 Execution Model 45
III.3 Code Translation 46
III.3.1 Translation Process 46
III.3.2 Translating OpenMP to OpenCL 48
III.3.3 Example of Code Translation 50
III.3.4 Data Transfer Minimization (DTM) 62
III.3.5 Performance Portability Enhancement (PPE) 66
III.4 Performance Evaluation 69
III.4.1 Evaluation Methodology 70
III.4.2 Effectiveness of Optimization Techniques 74
III.4.3 Comparison with Other Implementations 79
Chapter IV. Support for a Heterogeneous Cluster 90
IV.1 Problems of Previous Approaches 90
IV.2 The Approach of SnuCL-D 91
IV.2.1 Overhead Analysis 93
IV.2.2 Remote Device Virtualization 94
IV.2.3 Redundant Computation and Data Replication 95
IV.2.4 Memory-read Commands 97
IV.3 Consistency Management 98
IV.4 Deterministic Command Scheduling 100
IV.5 New API Function: clAttachBufferToDevice() 103
IV.6 Queueing Optimization 104
IV.7 Performance Evaluation 105
IV.7.1 Evaluation Methodology 105
IV.7.2 Evaluation with a Microbenchmark 109
IV.7.3 Evaluation on the Large-scale CPU Cluster 111
IV.7.4 Evaluation on the Medium-scale GPU Cluster 123
Chapter V. Conclusion and Future Work 125
Bibliography 129
Korean Abstract 140Docto