Search CORE

98 research outputs found

Solving the Ghost-Gluon System of Yang-Mills Theory on GPUs

Author: Aguilar
Alkofer
Alkofer
Alkofer
Atkinson
Boucaud
Cucchieri
Dyson
Fischer
Fischer
Fischer
Fischer
Fister
Glimm
Gribov
Gundolf Haase
Haag
Huber
Kugo
Lerche
Maas
Maas
Maas
Maas
Mandelstam
Maris
Markus Hopfer
Nakanishi
NVIDIA Corporation
NVIDIA Corporation
Osterwalder
Pawlowski
Reinhard Alkofer
Schwinger
Schwinger
Sternbeck
Sternbeck
Takahasi
Taylor
von Smekal
von Smekal
von Smekal
Watson
Zwanziger
Zwanziger
Publication venue: 'Elsevier BV'
Publication date: 18/12/2012
Field of study

We solve the ghost-gluon system of Yang-Mills theory using Graphics Processing Units (GPUs). Working in Landau gauge, we use the Dyson-Schwinger formalism for the mathematical description as this approach is well-suited to directly benefit from the computing power of the GPUs. With the help of a Chebyshev expansion for the dressing functions and a subsequent appliance of a Newton-Raphson method, the non-linear system of coupled integral equations is linearized. The resulting Newton matrix is generated in parallel using OpenMPI and CUDA(TM). Our results show, that it is possible to cut down the run time by two orders of magnitude as compared to a sequential version of the code. This makes the proposed techniques well-suited for Dyson-Schwinger calculations on more complicated systems where the Yang-Mills sector of QCD serves as a starting point. In addition, the computation of Schwinger functions using GPU devices is studied.Comment: 19 pages, 7 figures, additional figure added, dependence on block-size is investigated in more detail, version accepted by CP

arXiv.org e-Print Archive

Crossref

Solving Lattice QCD systems of equations using mixed precision solvers on GPUs

Author: Barros
Brannick
Bulava
Bunk
C. Rebbi
Clark
De Forcrand
DeGrand
Edwards
Egri
Holmgren
K. Barros
Kahan
M.A. Clark
Martin
NVIDIA Corporation
R. Babich
R.C. Brower
Sleijpen
Publication venue: 'Elsevier BV'
Publication date: 21/12/2009
Field of study

Modern graphics hardware is designed for highly parallel numerical tasks and promises significant cost and performance benefits for many scientific applications. One such application is lattice quantum chromodyamics (lattice QCD), where the main computational challenge is to efficiently solve the discretized Dirac equation in the presence of an SU(3) gauge field. Using NVIDIA's CUDA platform we have implemented a Wilson-Dirac sparse matrix-vector product that performs at up to 40 Gflops, 135 Gflops and 212 Gflops for double, single and half precision respectively on NVIDIA's GeForce GTX 280 GPU. We have developed a new mixed precision approach for Krylov solvers using reliable updates which allows for full double precision accuracy while using only single or half precision arithmetic for the bulk of the computation. The resulting BiCGstab and CG solvers run in excess of 100 Gflops and, in terms of iterations until convergence, perform better than the usual defect-correction approach for mixed precision.Comment: 30 pages, 7 figure

arXiv.org e-Print Archive

Crossref

Guarantee IP lookup performance with FIB explosion

Author: Derek P.
Devavrat S.
Feng W.
Francis Z.
Keith S.
Layong L.
Mahmoud M.
Masanori B.
NVIDIA Corporation
Pierluigi C.
Rina P.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2015
Field of study

Crossref

Open Repository and Bibliography - Liège

Accelerated Event-by-Event Neutrino Oscillation Reweighting with Matter Effects on a GPU

Author: A C Kaboth
D Payne
Khronos OpenCL Working Group
N. Whitehead
NVIDIA Corporation
OpenMP Architecture Review Board
P. Pomorski
R G Calland
R. Wendell
Publication venue: 'IOP Publishing'
Publication date: 29/11/2013
Field of study

Oscillation probability calculations are becoming increasingly CPU intensive in modern neutrino oscillation analyses. The independency of reweighting individual events in a Monte Carlo sample lends itself to parallel implementation on a Graphics Processing Unit. The library "Prob3++" was ported to the GPU using the CUDA C API, allowing for large scale parallelized calculations of neutrino oscillation probabilities through matter of constant density, decreasing the execution time by a factor of 75, when compared to performance on a single CPU.Comment: Final Update: Post submission update Updated version: quantified the difference in event rates for binned and event-by-event reweighting with a typical binning scheme. Improved formatting of reference

arXiv.org e-Print Archive

Crossref

Royal Holloway - Pure

APEnet+: high bandwidth 3D torus direct network for petaflops scale commodity clusters

Author: A Biagioni
A Lonardo
A Salamon
Ammendola R
Ammendola R
Ammendola R
Ammendola R
Ammendola R
Bodin F
Chalasani Suresh
D Rossetti
F Lo Cicero
F Simula
G Salina
L Tosoratto
NVIDIA Corporation
O Prezza
P S Paolucci
P Vicini
Paolucci P S
Paolucci P S
R Ammendola
Publication venue: 'IOP Publishing'
Publication date: 18/02/2011
Field of study

We describe herein the APElink+ board, a PCIe interconnect adapter featuring the latest advances in wire speed and interface technology plus hardware support for a RDMA programming model and experimental acceleration of GPU networking; this design allows us to build a low latency, high bandwidth PC cluster, the APEnet+ network, the new generation of our cost-effective, tens-of-thousands-scalable cluster network architecture. Some test results and characterization of data transmission of a complete testbench, based on a commercial development card mounting an Altera FPGA, are provided.Comment: 6 pages, 7 figures, proceeding of CHEP 2010, Taiwan, October 18-2

arXiv.org e-Print Archive

Crossref

Fine-grained bit-flip protection for relaxation methods

Author: Aliaga
Anzt
Anzt
Anzt
Bridges
Bronevetsky
Calhoun
Chalios
Chazan
Chen
Chow
Duff
Duranton
Elliott
Elliott
Frommer
Hennessy
HP Corp
Innovative Computing Lab
Kanter
Karpuzcu
Kogge
Lucas
Moore
NVIDIA Corporation
Saad
Sao
Trottenberg
Venkataramani
Publication venue: 'Elsevier BV'
Publication date: 01/09/2019
Field of study

[EN] Resilience is considered a challenging under-addressed issue that the high performance computing community (HPC) will have to face in order to produce reliable Exascale systems by the beginning of the next decade. As part of a push toward a resilient HPC ecosystem, in this paper we propose an error-resilient iterative solver for sparse linear systems based on stationary component-wise relaxation methods. Starting from a plain implementation of the Jacobi iteration, our approach introduces a low-cost component-wise technique that detects bit-flips, rejecting some component updates, and turning the initial synchronized solver into an asynchronous iteration. Our experimental study with sparse incomplete factorizations from a collection of real-world applications, and a practical GPU implementation, exposes the convergence delay incurred by the fault-tolerant implementation and its practical performance.This material is based upon work supported in part by the U.S. Department of Energy (Award Number DE-SC-0010042) and NVIDIA. E. S. Quintana-Orti was supported by project CICYT TIN2014-53495-R of MINECO and FEDER.Anzt, H.; Dongarra, J.; Quintana Ortí, ES. (2019). Fine-grained bit-flip protection for relaxation methods. Journal of Computational Science. 36:1-11. https://doi.org/10.1016/j.jocs.2016.11.013S11136Chow, E., & Patel, A. (2015). Fine-Grained Parallel Incomplete LU Factorization. SIAM Journal on Scientific Computing, 37(2), C169-C193. doi:10.1137/140968896Karpuzcu, U. R., Kim, N. S., & Torrellas, J. (2013). Coping with Parametric Variation at Near-Threshold Voltages. IEEE Micro, 33(4), 6-14. doi:10.1109/mm.2013.71Bronevetsky, G., & de Supinski, B. (2008). Soft error vulnerability of iterative linear algebra methods. Proceedings of the 22nd annual international conference on Supercomputing - ICS ’08. doi:10.1145/1375527.1375552Sao, P., & Vuduc, R. (2013). Self-stabilizing iterative solvers. Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - ScalA ’13. doi:10.1145/2530268.2530272Calhoun, J., Snir, M., Olson, L., & Garzaran, M. (2015). Understanding the Propagation of Error Due to a Silent Data Corruption in a Sparse Matrix Vector Multiply. 2015 IEEE International Conference on Cluster Computing. doi:10.1109/cluster.2015.101Chazan, D., & Miranker, W. (1969). Chaotic relaxation. Linear Algebra and its Applications, 2(2), 199-222. doi:10.1016/0024-3795(69)90028-7Frommer, A., & Szyld, D. B. (2000). On asynchronous iterations. Journal of Computational and Applied Mathematics, 123(1-2), 201-216. doi:10.1016/s0377-0427(00)00409-xDuff, I. S., & Meurant, G. A. (1989). The effect of ordering on preconditioned conjugate gradients. BIT, 29(4), 635-657. doi:10.1007/bf01932738Aliaga, J. I., Barreda, M., Dolz, M. F., Martín, A. F., Mayo, R., & Quintana-Ortí, E. S. (2014). Assessing the impact of the CPU power-saving modes on the task-parallel solution of sparse linear systems. Cluster Computing, 17(4), 1335-1348. doi:10.1007/s10586-014-0402-

Crossref

Repositori Institucional de la Universitat Jaume I

RiuNet

Simulation of reaction-diffusion processes in three dimensions using CUDA

Author: Alexandrov
Anderson
Belleman
Block
Buluc
Castano-Diez
Castets
Che
Costello
Cross
Dabdub
Epstein
Ferenc Izsák
Ferenc Molnár
Ford
Fowler
Garland
Gutiérrez
Horváth
Horváth
Horváth
Huang
István Lagzi
Januszewski
Komatitsch
Komatitsch
Lagzi
Lagzi
Lagzi
Lagzi
Lengyel
Li
Liu
Liu
Lovas
Martin
Melchionna
Micikevicius
Molnár
Nakamasu
NVIDIA Corporation
Owens
Preis
Pápai
Rácz
Róbert Mészáros
Sainio
Sanderson
Sanna
Sanna
Schmidt
Senocak
Shoji
Shoji
Simek
Stone
Stone
Sultan
Volford
Volford
Walsh
Publication venue: 'Elsevier BV'
Publication date: 03/04/2010
Field of study

Numerical solution of reaction-diffusion equations in three dimensions is one of the most challenging applied mathematical problems. Since these simulations are very time consuming, any ideas and strategies aiming at the reduction of CPU time are important topics of research. A general and robust idea is the parallelization of source codes/programs. Recently, the technological development of graphics hardware created a possibility to use desktop video cards to solve numerically intensive problems. We present a powerful parallel computing framework to solve reaction-diffusion equations numerically using the Graphics Processing Units (GPUs) with CUDA. Four different reaction-diffusion problems, (i) diffusion of chemically inert compound, (ii) Turing pattern formation, (iii) phase separation in the wake of a moving diffusion front and (iv) air pollution dispersion were solved, and additionally both the Shared method and the Moving Tiles method were tested. Our results show that parallel implementation achieves typical acceleration values in the order of 5-40 times compared to CPU using a single-threaded implementation on a 2.8 GHz desktop computer.Comment: 8 figures, 5 table

arXiv.org e-Print Archive

Crossref

University of Twente Research Information

Skyline: Interactive In-Editor Computational Performance Profiling for Deep Neural Network Training

Author: Abadi Martín
Chen Tianqi
Cito Jürgen
Coleman Cody
Dai Xiaoliang
Devlin Jacob
Frankle Jonathan
Huang Gao
Jain Paras
Kang Hyeonsu
Keskar Nitish Shirish
Lieber Tom
Mattson Peter
NVIDIA Corporation
Shallue Christopher J.
Simonyan Karen
Sutskever Ilya
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 20/08/2020
Field of study

Training a state-of-the-art deep neural network (DNN) is a computationally-expensive and time-consuming process, which incentivizes deep learning developers to debug their DNNs for computational performance. However, effectively performing this debugging requires intimate knowledge about the underlying software and hardware systems---something that the typical deep learning developer may not have. To help bridge this gap, we present Skyline: a new interactive tool for DNN training that supports in-editor computational performance profiling, visualization, and debugging. Skyline's key contribution is that it leverages special computational properties of DNN training to provide (i) interactive performance predictions and visualizations, and (ii) directly manipulatable visualizations that, when dragged, mutate the batch size in the code. As an in-editor tool, Skyline allows users to leverage these diagnostic features to debug the performance of their DNNs during development. An exploratory qualitative user study of Skyline produced promising results; all the participants found Skyline to be useful and easy to use.Comment: 14 pages, 5 figures. Appears in the proceedings of UIST'2

arXiv.org e-Print Archive

Crossref

Modeling of a method of parallel hierarchical transformation for fast recognition of dynamic images

Author: AA Yarovyy
AA Yarovyy
AJ Martin
C Chun-Yuan
CH Teh
D Everitt
D Knuth
DC Wunsch
DM Chitty
JM Khosrofian
JM Li
LI Timchenko
LI Timchenko
M Gorchetchnikov
M Martnez-Zarzuela
NG Basov
NVIDIA Corporation
P Abel
R Xu
RJ Meuth
RJ Prokop
S Kim
SU Jung
SX Liao
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Multi-core and many-core shared-memory parallel raycasting volume rendering optimization and tuning

Author: Butenhof DR
Cavin X
Datta K
E Wes Bethel
Fogal T
Fogal T
Foley JD
Grim S
Howison M
Howison M
Law A
Ma K-L
Mark Howison
Müller C
NVIDIA Corporation
Yu H
Publication venue: 'SAGE Publications'
Publication date
Field of study

Crossref