12 research outputs found

    Extending OmpSs for OpenCL kernel co-execution in heterogeneous systems

    Get PDF
    © 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Heterogeneous systems have a very high potential performance but present difficulties in their programming. OmpSs is a well known framework for task based parallel applications, which is an interesting tool to simplify the programming of these systems. However, it does not support the co-execution of a single OpenCL kernel instance on several compute devices. To overcome this limitation, this paper presents an extension of the OmpSs framework that solves two main objectives: the automatic division of datasets among several devices and the management of their memory address spaces. To adapt to different kinds of applications, the data division can be performed by the novel HGuided load balancing algorithm or by the well known Static and Dynamic. All this is accomplished with negligible impact on the programming. Experimental results reveal that there is always one load balancing algorithm that improves the performance and energy consumption of the system.This work has been supported by the University of Cantabria with grant CVE-2014-18166, the Generalitat de Catalunya under grant 2014-SGR-1051, the Spanish Ministry of Economy, Industry and Competitiveness under contracts TIN2016- 76635-C2-2-R (AEI/FEDER, UE) and TIN2015-65316-P. The Spanish Government through the Programa Severo Ochoa (SEV-2015-0493). The European Research Council under grant agreement No 321253 European Community’s Seventh Framework Programme [FP7/2007-2013] and Horizon 2020 under the Mont-Blanc Projects, grant agreement n 288777, 610402 and 671697 and the European HiPEAC Network.Peer ReviewedPostprint (published version

    High Performance Code Generation for Stencil Computation on Heterogeneous Multi-device Architectures

    Get PDF
    International audienceHeterogeneous architectures have been widely used in the domain of high performance computing. On one hand, it allows a designer to use multiple types of computing units and each able to execute the tasks that it is best suited for to increase performance; on the other hand, it brings many challenges in programming for novice users, especially for heterogeneous systems with multi-devices. In this paper, we propose the code generator STEPOCL that generates OpenCL host program for heterogeneous multi-device architecture. In order to simplify the analyzing process, we ask user to provide the description of input and kernel parameters in an XML file, then our generator analyzes the description and generates automatically the host program. Due to the data partition and data exchange strategies, the generated host program can be executed on multi-devices without changing any kernel code. The experiment of iterative stencil loop code (ISL) shows that our tool is efficient. It guarantees the minimum data exchanges and achieves high performance on heterogeneous multi-device architecture

    Automatic OpenCL Task Adaptation for Heterogeneous Architectures

    Get PDF
    International audienceOpenCL defines a common parallel programming language for all devices, although writing tasks adapted to the devices, managing communication and load-balancing issues are left to the programmer. In this work, we propose a novel automatic compiler and runtime technique to execute single OpenCL kernels on heterogeneous multi-device architectures. The technique proposed is completely transparent to the user, does not require off-line training or a performance model. It handles communications and load-balancing issues, resulting from hardware heterogeneity, load imbalance within the kernel itself and load variations between repeated executions of the kernel, in an iterative computation. We present our results on benchmarks and on an N-body application over two platforms, a 12-core CPU with two different GPUs and a 16-core CPU with three homogeneous GPUs

    Static partitioning and mapping of kernel-based applications over modern heterogeneous architectures

    Get PDF
    Heterogeneous Architectures Are Being Used Extensively To Improve System Processing Capabilities. Critical Functions Of Each Application (Kernels) Can Be Mapped To Different Computing Devices (I.E. Cpus, Gpgpus, Accelerators) To Maximize Performance. However, Best Performance Can Only Be Achieved If Kernels Are Accurately Mapped To The Right Device. Moreover, In Some Cases Those Kernels Could Be Split And Executed Over Several Devices At The Same Time To Maximize The Use Of Compute Resources On Heterogeneous Parallel Architectures. In This Paper, We Define A Static Partitioning Model Based On Profiling Information From Previous Executions. This Model Follows A Quantitative Model Approach Which Computes The Optimal Match According To User-Defined Constraints. We Test Different Scenarios To Evaluate Our Model: Single Kernel And Multi-Kernel Applications. Experimental Results Show That Our Static Partitioning Model Could Increase Performance Of Parallel Applications By Deploying Not Only Different Kernels Over Different Devices But A Single Kernel Over Multiple Devices. This Allows To Avoid Having Idle Compute Resources On Heterogeneous Platforms, As Well As Enhancing The Overall Performance. (C) 2015 Elsevier B.V. All Rights Reserved.The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007–2013) under grant agreement n. 609666 [24]

    SOCL: An OpenCL Implementation with Automatic Multi-Device Adaptation Support

    Get PDF
    To fully tap into the potential of today's heterogeneous machines, offloading parts of an application on accelerators is not sufficient. The real challenge is to build systems where the application would permanently spread across the entire machine, that is, where parallel tasks would be dynamically scheduled over the full set of available processing units. In this report we present SOCL, an OpenCL implementation that improves and simplifies the programming experience on heterogeneous architectures. SOCL enables applications to dynamically dispatch computation kernels over processing devices so as to maximize their utilization. OpenCL applications can incrementally make use of light extensions to automatically schedule kernels in a controlled manner on multi-device architectures. A preliminary automatic granularity adaptation extension is also provided. We demonstrate the relevance of our approach by experimenting with several OpenCL applications on a range of representative heterogeneous architectures. We show that performance portability is enhanced by using SOCL extensions.Pour exploiter au mieux les architectures hétérogènes actuelles, il n'est pas suffisant de déléguer aux accélérateurs seulement quelques portions de codes bien déterminées. Le véritable défi consiste à délivrer des applications qui exploitent de façon continue la totalité de l'architecture, c'est-à-dire dont l'ensemble des tâches parallèles les composant sont dynamiquement ordonnancées sur les unités d'exécution disponibles. Dans ce document, nous présentons SOCL, une implémentation de la spécification OpenCL étendue de sorte qu'elle soit plus simple d'utilisation et plus efficace sur les architectures hétérogènes. Cette implémentation peut ordonnancer automatiquement les noyaux de calcul sur les accélérateurs disponibles de façon à maximiser leur utilisation. Les applications utilisant déjà OpenCL peuvent être migrées de façon incrémentale et contrôlée vers SOCL car les extensions fournies sont non intrusives et requièrent très peu de modifications dans les codes. En plus de l'ordonnancement automatique de noyaux de calcul, une extension préliminaire permettant l'adaptation automatique de la granularité est mise à disposition. Nous démontrons la pertinence de cette approche et l'efficacité des extensions fournies à travers plusieurs expérimentations sur diverses architectures hétérogènes représentatives

    Sigmoid: An auto-tuned load balancing algorithm for heterogeneous systems

    Get PDF
    A challenge that heterogeneous system programmers face is leveraging the performance of all the devices that integrate the system. This paper presents Sigmoid, a new load balancing algorithm that efficiently co-executes a single OpenCL data-parallel kernel on all the devices of heterogeneous systems. Sigmoid splits the workload proportionally to the capabilities of the devices, drastically reducing response time and energy consumption. It is designed around several features; it is dynamic, adaptive, guided and effortless, as it does not require the user to give any parameter, adapting to the behaviourof each kernel at runtime. To evaluate Sigmoid's performance, it has been implemented in Maat, a system abstraction library. Experimental results with different kernel types show that Sigmoid exhibits excellent performance, reaching a utilization of 90%, together with energy savings up to 20%, always reducing programming effort compared to OpenCL, and facilitating the portability to other heterogeneous machines.This work has been supported by the Spanish Science and Technology Commission under contract PID2019-105660RB-C22 and the European HiPEAC Network of Excellence

    이종 클러스터를 위한 OpenCL 프레임워크

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2013. 8. 이재진.OpenCL은 이종 컴퓨팅 시스템의 다양한 계산 장치를 위한 통합 프로그래밍 모델이다. OpenCL은 다양한 이기종의 계산 장치에 대한 공통된 하드웨어 추상화 레이어를 프로그래머에게 제공한다. 프로그래머가 이 추상화 레이어를 타깃으로 OpenCL 어플리케이션을 작성하면, 이 어플리케이션은 OpenCL을 지원하는 모든 하드웨어에서 실행 가능하다. 하지만 현재 OpenCL은 단일 운영체제 시스템을 위한 프로그래밍 모델로 한정된다. 프로그래머가 명시적으로 MPI와 같은 통신 라이브러리를 사용하지 않으면 OpenCL 어플리케이션은 복수개의 노드로 이루어진 클러스터에서 동작하지 않는다. 요즘 들어 여러 개의 멀티코어 CPU와 가속기를 갖춘 이종 클러스터는 그 사용자 기반을 넓혀가고 있다. 이에 해당 이종 클러스터를 타깃으로 프로그래밍 하기 위해서는 프로그래머는 MPI-OpenCL 같이 여러 프로그래밍 모델을 혼합하여 어플리케이션을 작성해야 한다. 이는 어플리케이션을 복잡하게 만들어 유지보수가 어렵게 되며 이식성이 낮아진다. 본 논문에서는 이종 클러스터를 위한 OpenCL 프레임워크, SnuCL을 제안한다. 본 논문은 OpenCL 모델이 이종 클러스터 프로그래밍 환경에 적합하다는 것을 보인다. 이와 동시에 SnuCL이 고성능과 쉬운 프로그래밍을 동시에 달성할 수 있음을 보인다. SnuCL은 타깃 이종 클러스터에 대해서 단일 운영체제가 돌아가는 하나의 시스템 이미지를 사용자에게 제공한다. OpenCL 어플리케이션은 클러스터의 모든 계산 노드에 존재하는 모든 계산 장치가 호스트 노드에 있다는 환상을 갖게 된다. 따라서 프로그래머는 MPI 라이브러리와 같은 커뮤니케이션 API를 사용하지 않고 OpenCL 만을 사용하여 이종 클러스터를 타깃으로 어플리케이션을 작성할 수 있게 된다. SnuCL의 도움으로 OpenCL 어플리케이션은 단일 노드에서 이종 디바이스간 이식성을 가질 뿐만 아니라 이종 클러스터 환경에서도 디바이스간 이식성을 가질 수 있게 된다. 본 논문에서는 총 열한 개의 OpenCL 벤치마크 어플리케이션의 실험을 통하여 SnuCL의 성능을 보인다.OpenCL is a unified programming model for different types of computational units in a single heterogeneous computing system. OpenCL provides a common hardware abstraction layer across different computational units. Programmers can write OpenCL applications once and run them on any OpenCL-compliant hardware. However, one of the limitations of current OpenCL is that it is restricted to a programming model on a single operating system image. It does not work for a cluster of multiple nodes unless the programmer explicitly uses communication libraries, such as MPI. A heterogeneous cluster contains multiple general-purpose multicore CPUs and multiple accelerators to solve bigger problems within an acceptable time frame. As such clusters widen their user base, application developers for the clusters are being forced to turn to an unattractive mix of programming models, such as MPI-OpenCL. This makes the application more complex, hard to maintain, and less portable. In this thesis, we propose SnuCL, an OpenCL framework for heterogeneous clusters. We show that the original OpenCL semantics naturally fits to the heterogeneous cluster programming environment, and the framework achieves high performance and ease of programming. SnuCL provides a system image running a single operating system instance for heterogeneous clusters to the user. It allows the application to utilize compute devices in a compute node as if they were in the host node. No communication API, such as the MPI library, is required in the application source. With SnuCL, an OpenCL application becomes portable not only between heterogeneous devices in a single node, but also between compute devices in the cluster environment. We implement SnuCL and evaluate its performance using eleven OpenCL benchmark applications.Abstract I. Introduction I.1 Heterogeneous Computing I.2 Motivation I.3 Related Work I.4 Contributions I.5 Organization of this Thesis II. The OpenCL Architecture II.1 Platform Model II.2 Execution Model II.3 Memory Model II.4 OpenCL Applications III. The SnuCL Framework III.1 The SnuCL Runtime III.1.1 Mapping Components III.1.2 Organization of the SnuCL Runtime III.1.3 Processing Kernel-execution Commands III.1.4 Processing Synchronization Commands III.2 Memory Management III.2.1 The OpenCL Memory Model III.2.2 Space Allocation to Buffers III.2.3 Minimizing Memory Copying Overhead III.2.4 Processing Memory Commands III.2.5 Consistency Management III.2.6 Ease of Programming III.3 Extensions to OpenCL III.4 Code Transformations III.4.1 Detecting Buffers Written by a Kernel III.4.2 Emulating PEs for CPU Devices III.4.3 Distributing the Kernel Code IV. Distributed Execution Model for SnuCL IV.1 Two Problems in SnuCL IV.2 Remote Device Virtualization IV.2.1 Exclusive Execution on the Host IV.3 OpenCL Framework Integration IV.3.1 OpenCL Installable Client Driver (ICD) IV.3.2 Event Synchronization IV.3.3 Memory Sharing V. Experimental Results V.1 SnuCL Evaluation V.1.1 Methodology V.1.2 Results V.2 SnuCL-D Evaluation V.2.1 Methodology V.2.2 Results VI. Conclusions and Future Directions VI.1 Conclusions VI.2 Future Directions Bibliography Korean AbstractDocto

    Virtualizing Data Parallel Systems for Portability, Productivity, and Performance.

    Full text link
    Computer systems equipped with graphics processing units (GPUs) have become increasingly common over the last decade. In order to utilize the highly data parallel architecture of GPUs for general purpose applications, new programming models such as OpenCL and CUDA were introduced, showing that data parallel kernels on GPUs can achieve speedups by several orders of magnitude. With this success, applications from a variety of domains have been converted to use several complicated OpenCL/CUDA data parallel kernels to benefit from data parallel systems. Simultaneously, the software industry has experienced a massive growth in the amount of data to process, demanding more powerful workhorses for data parallel computation. Consequently, additional parallel computing devices such as extra GPUs and co-processors are attached to the system, expecting more performance and capability to process larger data. However, these programming models expose hardware details to programmers, such as the number of computing devices, interconnects, and physical memory size of each device. This degrades productivity in the software development process as programmers must manually split the workload with regard to hardware characteristics. This process is tedious and prone to errors, and most importantly, it is hard to maximize the performance at compile time as programmers do not know the runtime behaviors that can affect the performance such as input size and device availability. Therefore, applications lack portability as they may fail to run due to limited physical memory or experience suboptimal performance across different systems. To cope with these challenges, this thesis proposes a dynamic compiler framework that provides the OpenCL applications with an abstraction layer for physical devices. This abstraction layer virtualizes physical devices and memory sub-systems, and transparently orchestrates the execution of multiple data parallel kernels on multiple computing devices. The framework significantly improves productivity as it provides hardware portability, allowing programmers to write an OpenCL program without being concerned of the target devices. Our framework also maximizes performance by balancing the data parallel workload considering factors like kernel dependencies, device performance variation on workloads of different sizes, the data transfer cost over the interconnect between devices, and physical memory limits on each device.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113361/1/jhaeng_1.pd
    corecore