16 research outputs found

    CoreTSAR: Task Scheduling for Accelerator-aware Runtimes

    Get PDF
    Heterogeneous supercomputers that incorporate computational accelerators such as GPUs are increasingly popular due to their high peak performance, energy efficiency and comparatively low cost. Unfortunately, the programming models and frameworks designed to extract performance from all computational units still lack the flexibility of their CPU-only counterparts. Accelerated OpenMP improves this situation by supporting natural migration of OpenMP code from CPUs to a GPU. However, these implementations currently lose one of OpenMP’s best features, its flexibility: typical OpenMP applications can run on any number of CPUs. GPU implementations do not transparently employ multiple GPUs on a node or a mix of GPUs and CPUs. To address these shortcomings, we present CoreTSAR, our runtime library for dynamically scheduling tasks across heterogeneous resources, and propose straightforward extensions that incorporate this functionality into Accelerated OpenMP. We show that our approach can provide nearly linear speedup to four GPUs over only using CPUs or one GPU while increasing the overall flexibility of Accelerated OpenMP

    Heterogeneous Task Scheduling for Accelerated OpenMP

    Get PDF
    Abstract not provide

    Accelerating Face Anti-Spoofing Algorithms

    Get PDF
    Tato práce se specializuje na akceleraci algoritmu z oblasti obličejově zaměřených anti-spoofing algoritmů s využitím grafického hardware jakožto platformy pro paralelní zpracování dat. Jako framework je použita technologie OpenCL která umožňuje použití od výkoných stolních počítačů po přenosná zařízení, od různých akcelerátorů jako grafické čipy, či ASIC až po procesory typu x86 bez vazby na konkrétního výrobce či operační systém. Autor předkládá čtenáři rozbor a akcelerovanou implementaci široce používaného algoritmu a dopadu urychlení výpočtu.This thesis is specializes on algorithm acceleration from the field of face-based anti-spoofing. Graphics hardware is used as platform for data-parallel processing. As framework, the OpenCL is used. It allows execution on devices such as powerful desktop computers or hand-held devices as well as usage of different kind of processing units such as GPU, ASIC or CPU without any bound to hardware vendor or operating system. Author presents to reader analysis and accelerated implementation of widely used algorithm and impact of such improvement in execution time.

    Programming the Adapteva Epiphany 64-core Network-on-chip Coprocessor

    Full text link
    In the construction of exascale computing systems energy efficiency and power consumption are two of the major challenges. Low-power high performance embedded systems are of increasing interest as building blocks for large scale high- performance systems. However, extracting maximum performance out of such systems presents many challenges. Various aspects from the hardware architecture to the programming models used need to be explored. The Epiphany architecture integrates low-power RISC cores on a 2D mesh network and promises up to 70 GFLOPS/Watt of processing efficiency. However, with just 32 KB of memory per eCore for storing both data and code, and only low level inter-core communication support, programming the Epiphany system presents several challenges. In this paper we evaluate the performance of the Epiphany system for a variety of basic compute and communication operations. Guided by this data we explore strategies for implementing scientific applications on memory constrained low-powered devices such as the Epiphany. With future systems expected to house thousands of cores in a single chip, the merits of such architectures as a path to exascale is compared to other competing systems.Comment: 14 pages, submitted to IJHPCA Journal special editio

    Data Transfers Analysis in Computer Assisted Design Flow of FPGA Accelerators for Aerospace Systems

    Get PDF
    The integration of Field Programmable Gate Arrays (FPGAs) in an aerospace system improves its efficiency and its flexibility thanks to their programmability, but increases the design complexity. The design flows indeed have to be composed of several steps to fill the gap between the starting solution, which is usually a reference sequential implementation, and the final heterogeneous solution which includes custom hardware accelerators. Among these steps, there are the analysis of the application to identify the functionalities that gain advantages in execution on hardware and the generation of their implementations by means of Hardware Description Languages. Generating these descriptions for a software developer can be a very difficult task because of the different programming paradigms of software programs and hardware descriptions. To facilitate the developer in this activity, High Level Synthesis techniques have been developed aiming at (semi-)automatically generating hardware implementations of specifications written in high level languages (e.g., C). With respect to other embedded systems scenarios, the aerospace systems introduce further constraints that have to be taken into account during the design of these heterogeneous systems. In this type of systems explicit data transfers to and from FPGAs are preferred to the adoption of a shared memory architecture. The first approach indeed potentially improves the predictability of the produced solutions, but the sizes of all the data transferred to and from any devices must be known at design time. Identifying the sizes in presence of complex C applications which use pointers can be a not so easy task. In this paper, a semi-automatic design flow based on the integration of an aerospace design flow, an application analysis technique, and High Level Synthesis methodologies is presented. The initial reference application is analyzed to identify which are the sizes of the data exchanged among the different components of the application. Next, starting from the high level specification and from the results of this analysis, High Level Synthesis techniques are applied to automatically produce the hardware accelerators

    OpenCL의 프로그래밍 용이성 향상 기법

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2016. 2. 이재진.OpenCL is one of the major programming models for heterogeneous systems. This thesis presents two limitations of OpenCL, the complicated nature of programming in OpenCL and the lack of support for a heterogeneous cluster, and proposes a solution for each of them for ease of programming. The first limitation is that it is complicated to write a program using OpenCL. In order to lower this programming complexity, this thesis proposes a framework that translates a program written in a high-level language (OpenMP) to OpenCL at the source level. This thesis achieves both ease of programming and high performance by employing two techniquesdata transfer minimization (DTM) and performance portability enhancement (PPE). This thesis shows the effectiveness of the proposed translation framework by evaluating benchmark applications and the practicality by comparing it with the commercial PGI compiler. The second limitation of OpenCL is the lack of support for a heterogeneous cluster. In order to extend OpenCL to a heterogeneous cluster, this thesis proposes a framework called SnuCL-D that is able to execute a program written only in OpenCL on a heterogeneous cluster. Unlike previous approaches that apply a centralized approach, the proposed framework applies a decentralized approach, which gives a chance to reduce three kinds of overhead occurring in the execution path of commands. With the ability to analyze and reduce three kinds of overhead, the proposed framework shows good scalability for a large-scale cluster system. The proposed framework proves its effectiveness and practicality by compared to the representative centralized approach (SnuCL) and MPI with benchmark applications. This thesis proposes solutions for the two limitations of OpenCL for ease of programming on heterogeneous clusters. It is expected that application developers will be able to easily execute not only an OpenMP program on various accelerators but also a program written only in OpenCL on a heterogeneous cluster.Chapter I. Introduction 1 I.1 Motivation and Objectives 5 I.1.1 Programming Complexity 5 I.1.2 Lack of Support for a Heterogeneous Cluster 8 I.2 Contributions 12 Chapter II. Background and Related Work 15 II.1 Background 15 II.1.1 OpenCL 16 II.1.2 OpenMP 23 II.2 Related Work 26 II.2.1 Programming Complexity 26 II.2.2 Support for a Heterogeneous Cluster 29 Chapter III. Lowering the Programming Complexity 34 III.1 Motivating Example 35 III.1.1 Device Constructs 35 III.1.2 Needs for Data Transfer Optimization 41 III.2 Mapping OpenMP to OpenCL 44 III.2.1 Architecture Model 44 III.2.2 Execution Model 45 III.3 Code Translation 46 III.3.1 Translation Process 46 III.3.2 Translating OpenMP to OpenCL 48 III.3.3 Example of Code Translation 50 III.3.4 Data Transfer Minimization (DTM) 62 III.3.5 Performance Portability Enhancement (PPE) 66 III.4 Performance Evaluation 69 III.4.1 Evaluation Methodology 70 III.4.2 Effectiveness of Optimization Techniques 74 III.4.3 Comparison with Other Implementations 79 Chapter IV. Support for a Heterogeneous Cluster 90 IV.1 Problems of Previous Approaches 90 IV.2 The Approach of SnuCL-D 91 IV.2.1 Overhead Analysis 93 IV.2.2 Remote Device Virtualization 94 IV.2.3 Redundant Computation and Data Replication 95 IV.2.4 Memory-read Commands 97 IV.3 Consistency Management 98 IV.4 Deterministic Command Scheduling 100 IV.5 New API Function: clAttachBufferToDevice() 103 IV.6 Queueing Optimization 104 IV.7 Performance Evaluation 105 IV.7.1 Evaluation Methodology 105 IV.7.2 Evaluation with a Microbenchmark 109 IV.7.3 Evaluation on the Large-scale CPU Cluster 111 IV.7.4 Evaluation on the Medium-scale GPU Cluster 123 Chapter V. Conclusion and Future Work 125 Bibliography 129 Korean Abstract 140Docto
    corecore