18 research outputs found

    A Domain Specific Approach to High Performance Heterogeneous Computing

    Full text link
    Users of heterogeneous computing systems face two problems: firstly, in understanding the trade-off relationships between the observable characteristics of their applications, such as latency and quality of the result, and secondly, how to exploit knowledge of these characteristics to allocate work to distributed computing platforms efficiently. A domain specific approach addresses both of these problems. By considering a subset of operations or functions, models of the observable characteristics or domain metrics may be formulated in advance, and populated at run-time for task instances. These metric models can then be used to express the allocation of work as a constrained integer program, which can be solved using heuristics, machine learning or Mixed Integer Linear Programming (MILP) frameworks. These claims are illustrated using the example domain of derivatives pricing in computational finance, with the domain metrics of workload latency or makespan and pricing accuracy. For a large, varied workload of 128 Black-Scholes and Heston model-based option pricing tasks, running upon a diverse array of 16 Multicore CPUs, GPUs and FPGAs platforms, predictions made by models of both the makespan and accuracy are generally within 10% of the run-time performance. When these models are used as inputs to machine learning and MILP-based workload allocation approaches, a latency improvement of up to 24 and 270 times over the heuristic approach is seen.Comment: 14 pages, preprint draft, minor revisio

    Programmation multi-accélérateurs unifiée en OpenCL

    Get PDF
    National audienceLe standard OpenCL propose une interface de programmation adaptable à différents types d'accélérateurs (GPU, CPU, CELL. . . ). Pour chaque architecture, il revient aux applications d'effectuer explicitement les partitionnements et les transferts de données ainsi que les placements des tâches sur les accélérateurs disponibles, ce qui est très difficile. Néanmoins, nous montrons que le standard OpenCL peut également être utilisé avec une implémentation qui masque les différents accélérateurs aux applications et ne lui en présente qu'un seul (virtuel). Les transferts de données et les placements des tâches sont alors réalisés par l'implémentation. Nous montrons que ce modèle de programmation permet d'exploiter efficacement et de façon unifiée des architectures hétérogènes

    Static partitioning and mapping of kernel-based applications over modern heterogeneous architectures

    Get PDF
    Heterogeneous Architectures Are Being Used Extensively To Improve System Processing Capabilities. Critical Functions Of Each Application (Kernels) Can Be Mapped To Different Computing Devices (I.E. Cpus, Gpgpus, Accelerators) To Maximize Performance. However, Best Performance Can Only Be Achieved If Kernels Are Accurately Mapped To The Right Device. Moreover, In Some Cases Those Kernels Could Be Split And Executed Over Several Devices At The Same Time To Maximize The Use Of Compute Resources On Heterogeneous Parallel Architectures. In This Paper, We Define A Static Partitioning Model Based On Profiling Information From Previous Executions. This Model Follows A Quantitative Model Approach Which Computes The Optimal Match According To User-Defined Constraints. We Test Different Scenarios To Evaluate Our Model: Single Kernel And Multi-Kernel Applications. Experimental Results Show That Our Static Partitioning Model Could Increase Performance Of Parallel Applications By Deploying Not Only Different Kernels Over Different Devices But A Single Kernel Over Multiple Devices. This Allows To Avoid Having Idle Compute Resources On Heterogeneous Platforms, As Well As Enhancing The Overall Performance. (C) 2015 Elsevier B.V. All Rights Reserved.The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007–2013) under grant agreement n. 609666 [24]

    Automatic OpenCL Task Adaptation for Heterogeneous Architectures

    Get PDF
    International audienceOpenCL defines a common parallel programming language for all devices, although writing tasks adapted to the devices, managing communication and load-balancing issues are left to the programmer. In this work, we propose a novel automatic compiler and runtime technique to execute single OpenCL kernels on heterogeneous multi-device architectures. The technique proposed is completely transparent to the user, does not require off-line training or a performance model. It handles communications and load-balancing issues, resulting from hardware heterogeneity, load imbalance within the kernel itself and load variations between repeated executions of the kernel, in an iterative computation. We present our results on benchmarks and on an N-body application over two platforms, a 12-core CPU with two different GPUs and a 16-core CPU with three homogeneous GPUs

    Programmation unifiée multi-accélérateur OpenCL

    Get PDF
    National audienceLe standard OpenCL propose une interface de programmation basée sur un parallé- lisme de tâches et supportée par différents types d'unités de calcul (GPU, CPU, Cell. . . ). L'une des caractéristiques d'OpenCL est que le placement des tâches sur les différentes unités de cal- cul doit être fait manuellement. Pour une machine hybride disposant par exemple de multicœur et d'accélérateur(s), l'équilibrage de charge entre les différentes unités est très difficile à obte- nir à cause de cette contrainte. C'est particulièrement le cas des applications dont le grain et le nombre des tâches varient au cours de l'exécution. Il en découle par ailleurs que le passage à l'échelle d'une application OpenCL est limitée dans le contexte d'une machine hybride. Nous proposons dans cet article de remédier à cette limitation en créant une unité virtuelle et paral- lèle de calcul, regroupant les différentes unités de la machine. Le placement manuel d'OpenCL cible cette unité virtuelle, et la responsabilité du placement sur les unités réelles est laissée à un support exécutif. Ce support exécutif se charge d'effectuer les transferts de données et les placements des tâches sur les unités réelles. Nous montrons que cette solution permet de simpli- fier grandement la programmation d'applications pour architectures hybrides et cela de façon efficace

    Equity Swaps에 대한 고차수렴 유한차분법과 OpenCL을 이용한 Heterogeneous 컴퓨팅

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 협동과정 계산과학 전공, 2013. 8. 신동우.본 학위 논문에서는 Equity 스왑 모델에 대한 4차 수렴 유한차분법을 제안하였다. 특히 Equity 스왑 모델은 시간과 공간에 종속하는 계수들을 가지고 있기 때문에, 4차 수렴 유한차분법을 유도하기 위하여 특별한 좌표 변환을 고려하였다. 이 좌표 변환은 편미분 방정식에서 교차미분을 제거하는 것으로, 여러 예제들을 통해 그 수렴성을 검증하였다. 대부분의 선형해법들은 BLAS 알고리즘을 기반하여 구성되어있기 때문에, CPU와 GPU를 사용하여 BLAS를 병렬화 하는 연구를 수행하였다. 이것은 CPU와 GPU에 어떻게 작업을 분배할 것인가의 문제로 귀결되고, 분배하는 지점은 각 계산자원에서 소요되는 계산시간의 최소–최대 문제로 나타낼 수 있다. CPU와 GPU에서 특정 BLAS를 계산하는데 걸리는 시간을 다항함수의 형태로 예측함으로써, 최소–최대 문제와 실제 계산결과를 비교 분석하였다.A nine-point compact finite difference scheme with fourth-order convergence is proposed for an equity swap model. In order to derive a compact scheme for the equity swap model, a special treatment is necessary to remove the mixed derivative term so that the resulting scheme is of fourth-order convergence as well as compactness. A suitable coordinate transformation is proposed to eliminate the mixed derivative term successfully. The resulting algorithm is shown to be a fourth order convergent scheme. Various examples confirm the validity of the proposed scheme. Since most of linear solvers consist of basic linear algebra subroutines (BLAS), we optimize computational performance by distributing a subroutine into CPU and GPU with some splitting ratio. We present this splitting ratio by means of a min-max problem concerning with computational times. Computational times for both CPU and GPU are estimated as polynomial functions based on their capabilities. BLAS saxpy, sgemv, and sgemm are implemented in OpenCL and we verified our min-max model with actual heterogeneous computing results.I A Higher-Order Finite-Difference Scheme for Equity Swaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1 Previous Studies . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Equity Swaps . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Higher-order compact Finite difference scheme 13 2.1 Seeking a higher-order scheme . . . . . . . . . . . . 14 2.2 Coordinate transformation . . . . . . . . . . . . . . . . 17 2.3 A nine-point compact scheme . . . . . . . . . . . . . 19 3 Stability analysis . . . . . . . . . . . . . . . . . . . . . . . . 25 4 Numerical results . . . . . . . . . . . . . . . . . . . . . . . 31 5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 II Heterogeneous Computing with OpenCL . . . . . . . .41 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3 Implementation Issues . . . . . . . . . . . . . . . . . . . . 47 3.1 Concurrency in Heterogeneous Computing . . . . . 48 3.2 CPU Parking Protocol . . . . . . . . . . . . . . . . . . . . 52 4 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.1 Performance Estimations . . . . . . . . . . . . . . . . . 57 4.2 PCI express Bandwidth . . . . . . . . . . . . . . . . . . 59 5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 A Hardware Parameters . . . . . . . . . . . . . . . . . . . . 69 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71Docto
    corecore