Search CORE

1,088 research outputs found

PyCUDA and PyOpenCL: A Scripting-Based Approach to GPU Run-Time Code Generation

Author: Ahmed Fasih
Andreas Klöckner
Bell
Bryan Catanzaro
Buck
Chandler
Dalcín
Eich
Feldman
Flanagan
Frigo
Group
Hestenes
Hesthaven
Kennedy
Klöckner
Lam
Langtangen
Lindholm
McCarthy
McCool
Nicolas Pinto
Oliphant
Owens
Paul Ivanov
Pinto
Pinto
Prud’homme
Reynders
Seiler
Stein
Valiant
van Hateren
Veldhuizen
Wang
Whaley
Yunsup Lee
Publication venue: 'Elsevier BV'
Publication date: 29/03/2011
Field of study

High-performance computing has recently seen a surge of interest in heterogeneous systems, with an emphasis on modern Graphics Processing Units (GPUs). These devices offer tremendous potential for performance and efficiency in important large-scale applications of computational science. However, exploiting this potential can be challenging, as one must adapt to the specialized and rapidly evolving computing environment currently exhibited by GPUs. One way of addressing this challenge is to embrace better techniques and develop tools tailored to their needs. This article presents one simple technique, GPU run-time code generation (RTCG), along with PyCUDA and PyOpenCL, two open-source toolkits that support this technique. In introducing PyCUDA and PyOpenCL, this article proposes the combination of a dynamic, high-level scripting language with the massive performance of a GPU as a compelling two-tiered computing platform, potentially offering significant performance and productivity advantages over conventional single-tier, static systems. The concept of RTCG is simple and easily implemented using existing, robust infrastructure. Nonetheless it is powerful enough to support (and encourage) the creation of custom application-specific tools by its users. The premise of the paper is illustrated by a wide range of examples where the technique has been applied with considerable success.Comment: Submitted to Parallel Computing, Elsevie

arXiv.org e-Print Archive

Crossref

cphVB: A System for Automated Runtime Optimization and Parallelization of Vectorized Applications

Author: Blum Troels
Kristensen Mads Ruben Burgdorff
Lund Simon Andreas Frimann
Vinter Brian
Publication venue
Publication date: 01/01/2012
Field of study

Modern processor architectures, in addition to having still more cores, also require still more consideration to memory-layout in order to run at full capacity. The usefulness of most languages is deprecating as their abstractions, structures or objects are hard to map onto modern processor architectures efficiently. The work in this paper introduces a new abstract machine framework, cphVB, that enables vector oriented high-level programming languages to map onto a broad range of architectures efficiently. The idea is to close the gap between high-level languages and hardware optimized low-level implementations. By translating high-level vector operations into an intermediate vector bytecode, cphVB enables specialized vector engines to efficiently execute the vector operations. The primary success parameters are to maintain a complete abstraction from low-level details and to provide efficient code execution across different, modern, processors. We evaluate the presented design through a setup that targets multi-core CPU architectures. We evaluate the performance of the implementation using Python implementations of well-known algorithms: a jacobi solver, a kNN search, a shallow water simulation and a synthetic stencil simulation. All demonstrate good performance

arXiv.org e-Print Archive

Crossref

Copenhagen University Research Information System

BioThreads: a novel VLIW-based chip multiprocessor for accelerating biomedical image processing applications

Author: Angelos S. Echiadis (7203827)
David Stevens (4254274)
Jia Zheng (140297)
Sijung Hu (1250727)
V Azorin-Peris (7013798)
Vassilios Chouliaras (1251600)
Publication venue
Publication date: 01/01/2011
Field of study

We discuss BioThreads, a novel, configurable, extensible system-on-chip multiprocessor and its use in accelerating biomedical signal processing applications such as imaging photoplethysmography (IPPG). BioThreads is derived from the LE1 open-source VLIW chip multiprocessor and efficiently handles instruction, data and thread-level parallelism. In addition, it supports a novel mechanism for the dynamic creation, and allocation of software threads to uncommitted processor cores by implementing key POSIX Threads primitives directly in hardware, as custom instructions. In this study, the BioThreads core is used to accelerate the calculation of the oxygen saturation map of living tissue in an experimental setup consisting of a high speed image acquisition system, connected to an FPGA board and to a host system. Results demonstrate near-linear acceleration of the core kernels of the target blood perfusion assessment with increasing number of hardware threads. The BioThreads processor was implemented on both standard-cell and FPGA technologies; in the first case and for an issue width of two, full real-time performance is achieved with 4 cores whereas on a mid-range Xilinx Virtex6 device this is achieved with 10 dual-issue cores. An 8-core LE1 VLIW FPGA prototype of the system achieved 240 times faster execution time than the scalar Microblaze processor demonstrating the scalability of the proposed solution to a state-of-the-art FPGA vendor provided soft CPU core

Loughborough University Institutional Repository

Multitarget Compilation Techniques for Generating Efficient OpenCL Code from Matrix-Oriented Computations

Author: Luis Alexandre Cubal dos Reis
Publication venue
Publication date: 27/01/2020
Field of study

Repositório Aberto da Universidade do Porto

PIPS Is not (just) Polyhedral Software Adding GPU Code Generation in PIPS

Author: Amini Mehdi
Ancourt Corinne
Coelho Fabien
Creusillet Béatrice
Guelton Serge
Irigoin François
Jouvelot Pierre
Keryell Ronan
Villalon Pierre
Publication venue: HAL CCSD
Publication date: 03/04/2011
Field of study

6 pagesInternational audienceParallel and heterogeneous computing are growing in audience thanks to the increased performance brought by ubiquitous manycores and GPUs. However, available programming models, like OPENCL or CUDA, are far from being straightforward to use. As a consequence, several automated or semi-automated approaches have been proposed to automatically generate hardware-level codes from high-level sequential sources. Polyhedral models are becoming more popular because of their combination of expressiveness, compactness, and accurate abstraction of the data-parallel behaviour of programs. These models provide automatic or semi-automatic parallelization and code transformation capabilities that target such modern parallel architectures. PIPS is a quarter-century old source-to-source transformation framework that initially targeted parallel machines but then evolved to include other targets. PIPS uses abstract interpretation on an integer polyhedral lattice to represent program code, allowing linear relation analysis on integer variables in an interprocedural way. The same representation is used for the dependence test and the convex array region analysis. The polyhedral model is also more classically used to schedule code from linear constraints. In this paper, we illustrate the features of this compiler infrastructure on an hypothetical input code, demonstrating the combination of polyhedral and non polyhedral transformations. PIPS interprocedural polyhedral analyses are used to generate data transfers and are combined with non-polyhedral transformations to achieve efficient CUDA code generation

HAL-MINES ParisTech

A Model-based Design Framework for Application-specific Heterogeneous Systems

Author: Skalicky Samuel
Publication venue: RIT Scholar Works
Publication date: 01/05/2015
Field of study

The increasing heterogeneity of computing systems enables higher performance and power efficiency. However, these improvements come at the cost of increasing the overall complexity of designing such systems. These complexities include constructing implementations for various types of processors, setting up and configuring communication protocols, and efficiently scheduling the computational work. The process for developing such systems is iterative and time consuming, with no well-defined performance goal. Current performance estimation approaches use source code implementations that require experienced developers and time to produce. We present a framework to aid in the design of heterogeneous systems and the performance tuning of applications. Our framework supports system construction: integrating custom hardware accelerators with existing cores into processors, integrating processors into cohesive systems, and mapping computations to processors to achieve overall application performance and efficient hardware usage. It also facilitates effective design space exploration using processor models (for both existing and future processors) that do not require source code implementations to estimate performance. We evaluate our framework using a variety of applications and implement them in systems ranging from low power embedded systems-on-chip (SoC) to high performance systems consisting of commercial-off-the-shelf (COTS) components. We show how the design process is improved, reducing the number of design iterations and unnecessary source code development ultimately leading to higher performing efficient systems

RIT Scholar Works

Towards Parallel Execution of Sequential Scientific Applications

Author: Kristensen Mads Ruben Burgdorff
Publication venue
Publication date: 06/09/2012
Field of study

Copenhagen University Research Information System

LEGaTO: first steps towards energy-efficient toolset for heterogeneous computing

Author: Alvarez Carlos
Bautista Leonardo
Becker Tobias
Billung-Meyer Gunnar
Carpenter Paul
Christmann Wolfgang
Cristal Adrian
De La Cruz Raul
Dubhashi Devdatt
Etsion Yoav
Felber Pascal
Fetzer Christof
Gaydadjiev Georgi
Göttel Christian
Hadar Elad
Hagemeyer Jens
Jimenez Daniel
Jungeblut Thorsten
Kaiser Martin
Klawonn Frank
Krupop Stefan
Kucza Nils
Madonar Sergi
Martorell Xavier
Mihklafi Amani
Mudge Trevor
Mudge Trevor
Pasin Marcelo
Pericàs Miquel
Pnevmatikatos Dionisios N.
Porrmann Mario
Port Oron
Rocha Isabelly
Salami Behzad
Salomonsson Hans
Schiavoni Valerio
Trancoso Pedro
Unsal Osman S.
vor dem Berge Micha
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2018
Field of study

LEGaTO is a three-year EU H2020 project which started in December 2017. The LEGaTO project will leverage task-based programming models to provide a software ecosystem for Made-in-Europe heterogeneous hardware composed of CPUs, GPUs, FPGAs and dataflow engines. The aim is to attain one order of magnitude energy savings from the edge to the converged cloud/HPC.Peer ReviewedPostprint (author's final draft

Crossref

UPCommons. Portal del coneixement obert de la UPC

Chalmers Research

Publications at Bielefeld University