Search CORE

13,637 research outputs found

Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics

Author: Babich Ronald
Clark Michael A.
Joó Bálint
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

Graphics Processing Units (GPUs) are having a transformational effect on numerical lattice quantum chromodynamics (LQCD) calculations of importance in nuclear and particle physics. The QUDA library provides a package of mixed precision sparse matrix linear solvers for LQCD applications, supporting single GPUs based on NVIDIA's Compute Unified Device Architecture (CUDA). This library, interfaced to the QDP++/Chroma framework for LQCD calculations, is currently in production use on the "9g" cluster at the Jefferson Laboratory, enabling unprecedented price/performance for a range of problems in LQCD. Nevertheless, memory constraints on current GPU devices limit the problem sizes that can be tackled. In this contribution we describe the parallelization of the QUDA library onto multiple GPUs using MPI, including strategies for the overlapping of communication and computation. We report on both weak and strong scaling for up to 32 GPUs interconnected by InfiniBand, on which we sustain in excess of 4 Tflops.Comment: 11 pages, 7 figures, to appear in the Proceedings of Supercomputing 2010 (submitted April 12, 2010

arXiv.org e-Print Archive

CiteSeerX

Performance comparison of single-precision SPICE Model-Evaluation on FPGA, GPU, Cell, and multi-core processors

Author: DeHon André
Kapre Nachiket
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2009
Field of study

Automated code generation and performance tuning techniques for concurrent architectures such as GPUs, Cell and FPGAs can provide integer factor speedups over multi-core processor organizations for data-parallel, floating-point computation in SPICE model-evaluation. Our Verilog AMS compiler produces code for parallel evaluation of non-linear circuit models suitable for use in SPICE simulations where the same model is evaluated several times for all the devices in the circuit. Our compiler uses architecture specific parallelization strategies (OpenMP for multi-core, PThreads for Cell, CUDA for GPU, statically scheduled VLIW for FPGA) when producing code for these different architectures. We automatically explore different implementation configurations (e.g. unroll factor, vector length) using our performance-tuner to identify the best possible configuration for each architecture. We demonstrate speedups of 3- 182times for a Xilinx Virtex5 LX 330T, 1.3-33times for an IBM Cell, and 3-131times for an NVIDIA 9600 GT GPU over a 3 GHz Intel Xeon 5160 implementation for a variety of single-precision device models

Crossref

Caltech Authors

DR-NTU (Digital Repository of NTU)

High-Precision Numerical Simulations of Rotating Black Holes Accelerated by CUDA

Author: Ginjupalli Rakesh
Khanna Gaurav
Publication venue
Publication date: 01/01/2010
Field of study

Hardware accelerators (such as Nvidia's CUDA GPUs) have tremendous promise for computational science, because they can deliver large gains in performance at relatively low cost. In this work, we focus on the use of Nvidia's Tesla GPU for high-precision (double, quadruple and octal precision) numerical simulations in the area of black hole physics -- more specifically, solving a partial-differential-equation using finite-differencing. We describe our approach in detail and present the final performance results as compared with a single-core desktop processor and also the Cell BE. We obtain mixed results -- order-of-magnitude gains in overall performance in some cases and negligible gains in others.Comment: 6 pages, 1 figure, 1 table, Accepted for publication in the International Conference on High Performance Computing Systems (HPCS 2010

arXiv.org e-Print Archive

CiteSeerX

A high-accuracy optical linear algebra processor for finite element applications

Author: Casasent D.
Taylor B. K.
Publication venue
Publication date
Field of study

Optical linear processors are computationally efficient computers for solving matrix-matrix and matrix-vector oriented problems. Optical system errors limit their dynamic range to 30-40 dB, which limits their accuray to 9-12 bits. Large problems, such as the finite element problem in structural mechanics (with tens or hundreds of thousands of variables) which can exploit the speed of optical processors, require the 32 bit accuracy obtainable from digital machines. To obtain this required 32 bit accuracy with an optical processor, the data can be digitally encoded, thereby reducing the dynamic range requirements of the optical system (i.e., decreasing the effect of optical errors on the data) while providing increased accuracy. This report describes a new digitally encoded optical linear algebra processor architecture for solving finite element and banded matrix-vector problems. A linear static plate bending case study is described which quantities the processor requirements. Multiplication by digital convolution is explained, and the digitally encoded optical processor architecture is advanced

NASA Technical Reports Server

An adaptive hierarchical domain decomposition method for parallel contact dynamics simulations of granular materials

Author: Allen
Anitescu
Brendel
Calvetti
Cundall
Deng
Dietrich E. Wolf
Fleissner
Haff
Iglberger
Iglberger
Jean
Joer
Jourdan
János Török
Kadau
Kaufman
Knudsen
Lothar Brendel
Luding
Lötstedt
M. Reza Shaebani
McNamara
Miller
Miller
Moreau
Mueth
Nassi
Nyland
Plimpton
Plimpton
Press
Radjai
Radjai
Rapaport
Renouf
Revathi
Rock
Shaebani
Shaebani
Stewart
Stewart
Stewart
Unger
Unger
Unger
Wackenhut
Walton
Zahra Shojaaee
Publication venue: 'Elsevier BV'
Publication date: 28/12/2011
Field of study

A fully parallel version of the contact dynamics (CD) method is presented in this paper. For large enough systems, 100% efficiency has been demonstrated for up to 256 processors using a hierarchical domain decomposition with dynamic load balancing. The iterative scheme to calculate the contact forces is left domain-wise sequential, with data exchange after each iteration step, which ensures its stability. The number of additional iterations required for convergence by the partially parallel updates at the domain boundaries becomes negligible with increasing number of particles, which allows for an effective parallelization. Compared to the sequential implementation, we found no influence of the parallelization on simulation results.Comment: 19 pages, 15 figures, published in Journal of Computational Physics (2011

arXiv.org e-Print Archive

Crossref

HONEI: A collection of libraries for numerical computations targeting multiple processor architectures.

Author: Geveler Markus
Gutwenger Carsten
Göddeke Dominik
Mallach Sven
Ribbrock Dirk
van Dyk Danny
Publication venue: 'Elsevier BV'
Publication date: 01/01/2009
Field of study

We present HONEI, an open-source collection of libraries offering a hardware oriented approach to numerical calculations. HONEI abstracts the hardware, and applications written on top of HONEI can be executed on a wide range of computer architectures such as CPUs, GPUs and the Cell processor. We demonstrate the flexibility and performance of our approach with two test applications, a Finite Element multigrid solver for the Poisson problem and a robust and fast simulation of shallow water waves. By linking against HONEI's libraries, we achieve a two-fold speedup over straight forward C++ code using HONEI's SSE backend, and additional 3--4 and 4--16 times faster execution on the Cell and a GPU. A second important aspect of our approach is that the full performance capabilities of the hardware under consideration can be exploited by adding optimised application-specific operations to the HONEI libraries. HONEI provides all necessary infrastructure for development and evaluation of such kernels, significantly simplifying their development

arXiv.org e-Print Archive

computer science publication server

Kölner UniversitätsPublikationsServer