Search CORE

52 research outputs found

GPU Concurrency: Weak Behaviours and Programming Assumptions

Author: AMD.
AMD.
AMD.
Cederman D.
Cederman D.
Collier W.
Core ARM.
Feng W.
Hower D. R.
Hwu W.-m. W.
Khronos OpenCL Working Group
Sanders J.
Sorensen T.
Southern AMD.
Stuart J. A.
Weaver D. L.
Xiao S.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 10/11/2014
Field of study

Concurrency is pervasive and perplexing, particularly on graphics processing units (GPUs). Current specifications of languages and hardware are inconclusive; thus programmers often rely on folklore assumptions when writing software. To remedy this state of affairs, we conducted a large empirical study of the concurrent behaviour of deployed GPUs. Armed with litmus tests (i.e. short concurrent programs), we questioned the assumptions in programming guides and vendor documentation about the guarantees provided by hardware. We developed a tool to generate thousands of litmus tests and run them under stressful workloads. We observed a litany of previously elusive weak behaviours, and exposed folklore beliefs about GPU programming---often supported by official tutorials---as false. As a way forward, we propose a model of Nvidia GPU hardware, which correctly models every behaviour witnessed in our experiments. The model is a variant of SPARC Relaxed Memory Order (RMO), structured following the GPU concurrency hierarchy

Crossref

Oxford University Research Archive

Kent Academic Repository

Spiral - Imperial College Digital Repository

Accelerated Event-by-Event Neutrino Oscillation Reweighting with Matter Effects on a GPU

Author: A C Kaboth
D Payne
Khronos OpenCL Working Group
N. Whitehead
NVIDIA Corporation
OpenMP Architecture Review Board
P. Pomorski
R G Calland
R. Wendell
Publication venue: 'IOP Publishing'
Publication date: 29/11/2013
Field of study

Oscillation probability calculations are becoming increasingly CPU intensive in modern neutrino oscillation analyses. The independency of reweighting individual events in a Monte Carlo sample lends itself to parallel implementation on a Graphics Processing Unit. The library "Prob3++" was ported to the GPU using the CUDA C API, allowing for large scale parallelized calculations of neutrino oscillation probabilities through matter of constant density, decreasing the execution time by a factor of 75, when compared to performance on a single CPU.Comment: Final Update: Post submission update Updated version: quantified the difference in event rates for binned and event-by-event reweighting with a typical binning scheme. Improved formatting of reference

arXiv.org e-Print Archive

Crossref

Royal Holloway - Pure

A GPU-accelerated implicit meshless method for compressible flows

Author: Agarwal
Antoniou
Azab
Batina
Blazek
Cao
Cheng Cao
Chiu
Das
Elsen
Farber
Frink
Fuhry
Goulos
Hong-Quan Chen
Jameson
Jia-Le Zhang
Katz
Katz
KHRONOS
Klöckner
Li
Liu
Lohner
Ma
Ma
Mani
NVIDIA
NVIDIA
Oñate
PGI
Phillips
Remacle
Roque
Sato
Schmitt
Stone
Xia
Yoon
Yoon
Zhang
Zhi-Hua Ma
Zimmerman
Publication venue: 'Elsevier BV'
Publication date: 04/02/2018
Field of study

This paper develops a recently proposed GPU based two-dimensional explicit meshless method (Ma et al., 2014) by devising and implementing an efficient parallel LU-SGS implicit algorithm to further improve the computational efficiency. The capability of the original 2D meshless code is extended to deal with 3D complex compressible flow problems. To resolve the inherent data dependency of the standard LU-SGS method, which causes thread-racing conditions destabilizing numerical computation, a generic rainbow coloring method is presented and applied to organize the computational points into different groups by painting neighboring points with different colors. The original LU-SGS method is modified and parallelized accordingly to perform calculations in a color-by-color manner. The CUDA Fortran programming model is employed to develop the key kernel functions to apply boundary conditions, calculate time steps, evaluate residuals as well as advance and update the solution in the temporal space. A series of two- and three-dimensional test cases including compressible flows over single- and multi-element airfoils and a M6 wing are carried out to verify the developed code. The obtained solutions agree well with experimental data and other computational results reported in the literature. Detailed analysis on the performance of the developed code reveals that the developed CPU based implicit meshless method is at least four to eight times faster than its explicit counterpart. The computational efficiency of the implicit method could be further improved by ten to fifteen times on the GPU

Crossref

E-space: Manchester Metropolitan University's Research Repository

GRACE-2: integrating fine-grained application adaptation with global adaptation for saving energy

Author: Albert F. Harris
Anzinger
Aydin
Caccamo
Chu
Cisco
Corner
Daniel G. Sachs
de Lara
Douglas L. Jones
Efstratiou
Flautner
Flinn
Flinn
Gopalan
Hughes
Hughes
Intel
Ishihara
Khronos
Klara Nahrstedt
Krantz
Li
Mesarina
Moser
Noble
Pering
Pillai
Poellabauer
Quan
Robin H. Kravets
Rusu
Sachs
Sachs
Sarita V. Adve
Simunic
Vardhan
Vibhore Vardhan
Wanghong Yuan
Xiph.org
Yuan
Yuan
Yuan
Zeng
Publication venue: 'Inderscience Publishers'
Publication date: 01/01/2009
Field of study

Crossref

The OpenCL Specification Version: 1.0 Document Revision: 29

Author: Aaftab Munshi
Khronos Opencl
Publication venue
Publication date
Field of study

CiteSeerX