Search CORE

86,710 research outputs found

CU2CL: A CUDA-to-OpenCL Translator for Multi- and Many-core Architectures

Author: Feng Wu-chun
Gardner Mark
Martinez Gabriel
Publication venue
Publication date: 01/01/2011
Field of study

The use of graphics processing units (GPUs) in high-performance parallel computing continues to become more prevalent, often as part of a heterogeneous system. For years, CUDA has been the de facto programming environment for nearly all general-purpose GPU (GPGPU) applications. In spite of this, the framework is available only on NVIDIA GPUs, traditionally requiring reimplementation in other frameworks in order to utilize additional multi- or many-core devices. On the other hand, OpenCL provides an open and vendorneutral programming environment and runtime system. With implementations available for CPUs, GPUs, and other types of accelerators, OpenCL therefore holds the promise of a “write once, run anywhere” ecosystem for heterogeneous computing. Given the many similarities between CUDA and OpenCL, manually porting a CUDA application to OpenCL is typically straightforward, albeit tedious and error-prone. In response to this issue, we created CU2CL, an automated CUDA-to- OpenCL source-to-source translator that possesses a novel design and clever reuse of the Clang compiler framework. Currently, the CU2CL translator covers the primary constructs found in CUDA runtime API, and we have successfully translated many applications from the CUDA SDK and Rodinia benchmark suite. The performance of our automatically translated applications via CU2CL is on par with their manually ported countparts

Computer Science Technical Reports @Virginia Tech

CiteSeerX

Scalable framework for heterogeneous clustering of commodity FPGAs

Author: Espenshade Jeremy K.
Publication venue: RIT Scholar Works
Publication date: 01/05/2009
Field of study

A combination of parallelism exploitation and application specific hardware is increasingly being used to address the computational requirements of a diverse and extensive set of application areas. These targeted applications have specific computational requirements that often are not able to be implemented optimally on general purpose processors and have the potential to experience substantial speedup on dedicated hardware. While general parallelism has been exploited at various levels for decades, the advent of heterogeneous cluster computing has allowed applications to be accelerated through the use of intelligently mapped computational tasks to well-suited hardware. This trend has continued with the use of dedicated ASIC and FPGA coprocessors to off-load particularly intensive computations. With the inclusion of embedded microprocessors into otherwise reconfigurable FPGA fabric, it has become feasible to construct a heterogeneous cluster composed of application specific hardware resources that can be programatically treated as fully functional and independent cluster nodes via a standard message passing interface. The contribution of this thesis is the development of such a framework for organizing heterogeneous clusters of reconfigurable FPGA computing elements into clusters that enable development of complex systems delivering on the promise of parallel reconfigurable hardware. The framework includes a fully featured message passing interface implementation for seamless communication and synchronization among nodes running in an embedded Linux operating system environment while managing hardware accelerators through device driver abstractions and standard APIs. A set of application case studies deployed on a test platform of Xilinx Virtex-4 and Virtex-5 FPGAs demonstrates functionality, elucidates performance characteristics, and promotes future research and development efforts

RIT Scholar Works

Limits on Fundamental Limits to Computation

Author: Markov Igor L.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

An indispensable part of our lives, computing has also become essential to industries and governments. Steady improvements in computer hardware have been supported by periodic doubling of transistor densities in integrated circuits over the last fifty years. Such Moore scaling now requires increasingly heroic efforts, stimulating research in alternative hardware and stirring controversy. To help evaluate emerging technologies and enrich our understanding of integrated-circuit scaling, we review fundamental limits to computation: in manufacturing, energy, physical space, design and verification effort, and algorithms. To outline what is achievable in principle and in practice, we recall how some limits were circumvented, compare loose and tight limits. We also point out that engineering difficulties encountered by emerging technologies may indicate yet-unknown limits.Comment: 15 pages, 4 figures, 1 tabl

arXiv.org e-Print Archive

CiteSeerX

Massively Parallel Computing at the Large Hadron Collider up to the HL-LHC

Author: Halyo Valerie
Lujan Paul
Publication venue: 'IOP Publishing'
Publication date: 12/05/2015
Field of study

As the Large Hadron Collider (LHC) continues its upward progression in energy and luminosity towards the planned High-Luminosity LHC (HL-LHC) in 2025, the challenges of the experiments in processing increasingly complex events will also continue to increase. Improvements in computing technologies and algorithms will be a key part of the advances necessary to meet this challenge. Parallel computing techniques, especially those using massively parallel computing (MPC), promise to be a significant part of this effort. In these proceedings, we discuss these algorithms in the specific context of a particularly important problem: the reconstruction of charged particle tracks in the trigger algorithms in an experiment, in which high computing performance is critical for executing the track reconstruction in the available time. We discuss some areas where parallel computing has already shown benefits to the LHC experiments, and also demonstrate how a MPC-based trigger at the CMS experiment could not only improve performance, but also extend the reach of the CMS trigger system to capture events which are currently not practical to reconstruct at the trigger level.Comment: 14 pages, 6 figures. Proceedings of 2nd International Summer School on Intelligent Signal Processing for Frontier Research and Industry (INFIERI2014), to appear in JINST. Revised version in response to referee comment

arXiv.org e-Print Archive

CERN Document Server

Towards the Specification of the GPU using Performance Parameters

Author: Perez Cristian
Piccoli Fabiana
Publication venue
Publication date: 04/10/2021
Field of study

The characteristics of graphics processing units (GPUs), especially their parallel execution capabilities and fast memory access, render them attractive in many application areas. They promise more than an order of magnitude speedup over conventional processors for some non-graphics computations. The use of GPUs in general-purpose computing is becoming a very accepted alternative. In addition, the CUDA programming model gains acceptance. Each of these arguments make necessary count with tools that allow to evaluate GPUs. The performance parameters allow to model an architecture to predict the execution time of any application with any parallelism level. Furthermore, they are a useful tool to compare different architectures and determine its advantages and troubles. This work presents suitable benchmarks to evaluate different performance parameters of GPUs. The presented measurements focus on two issues of GPU performance: computing power and the global memory bandwidth. Their estimation will allow us determine technical characteristics of GPU and, in consequence, the analysis and optimization of the performance of applications that could run on actual or future GPUs.Sociedad Argentina de Informática e Investigación Operativ

Servicio de Difusión de la Creación Intelectual

First Evaluation of the CPU, GPGPU and MIC Architectures for Real Time Particle Tracking based on Hough Transform at the LHC

Author: Halyo V.
Karpusenko V.
LeGresley P.
Lujan P.
Vladimirov A.
Publication venue: 'IOP Publishing'
Publication date: 28/10/2013
Field of study

Recent innovations focused around {\em parallel} processing, either through systems containing multiple processors or processors containing multiple cores, hold great promise for enhancing the performance of the trigger at the LHC and extending its physics program. The flexibility of the CMS/ATLAS trigger system allows for easy integration of computational accelerators, such as NVIDIA's Tesla Graphics Processing Unit (GPU) or Intel's \xphi, in the High Level Trigger. These accelerators have the potential to provide faster or more energy efficient event selection, thus opening up possibilities for new complex triggers that were not previously feasible. At the same time, it is crucial to explore the performance limits achievable on the latest generation multicore CPUs with the use of the best software optimization methods. In this article, a new tracking algorithm based on the Hough transform will be evaluated for the first time on a multi-core Intel Xeon E5-2697v2 CPU, an NVIDIA Tesla K20c GPU, and an Intel \xphi\ 7120 coprocessor. Preliminary time performance will be presented.Comment: 13 pages, 4 figures, Accepted to JINS

arXiv.org e-Print Archive

CERN Document Server