Search CORE

1,425 research outputs found

Toward large-scale Hybrid Monte Carlo simulations of the Hubbard model on graphics processing units

Author: Anderson
Armour
Babich
Baker
Barros
Bertsch
Blankenbecler
Bloch
Brannick
Broyden
Bulgac
Bulgac
Bulgac
Bunk
Carlson
Carlson
Castro Neto
Chang
Clark
Clark
Dagotto
Datta
DeGrand
Demchik
Drut
Drut
Drut
Drut
Duane
Egri
Essler
Frezzotti
Frezzotti
Frezzotti
Frezzotti
Giamarchi
Giorgini
Gottlieb
Gupta
Hands
Harvey
Heiselberg
Hubbard
Hubbard
Joaquín E. Drut
Kinast
Kolb
Kyle A. Wendt
Lee
Lee
Lombardo
Luo
Luo
Magierski
Rombouts
Saad
Schwenk
Shuryak
Stewart
Stratonovich
Suzuki
Suzuki
Timo A. Lähde
Trotter
Publication venue: 'Elsevier BV'
Publication date: 20/07/2010
Field of study

The performance of the Hybrid Monte Carlo algorithm is determined by the speed of sparse matrix-vector multiplication within the context of preconditioned conjugate gradient iteration. We study these operations as implemented for the fermion matrix of the Hubbard model in d+1 space-time dimensions, and report a performance comparison between a 2.66 GHz Intel Xeon E5430 CPU and an NVIDIA Tesla C1060 GPU using double-precision arithmetic. We find speedup factors ranging between 30-350 for d = 1, and in excess of 40 for d = 3. We argue that such speedups are of considerable impact for large-scale simulational studies of quantum many-body systems.Comment: 8 pages, 5 figure

arXiv.org e-Print Archive

Crossref

Implementing the conjugate gradient algorithm on multi-core systems

Author: Bakker V.
Kokkeler A.B.J.
Smit G.J.M.
Wiggers W.A.
Publication venue: IEEE
Publication date: 01/01/2007
Field of study

In linear solvers, like the conjugate gradient algorithm, sparse-matrix vector multiplication is an important kernel. Due to the sparseness of the matrices, the solver runs relatively slow. For digital optical tomography (DOT), a large set of linear equations have to be solved which currently takes in the order of hours on desktop computers. Our goal was to speed up the conjugate gradient solver. In this paper we present the results of applying multiple optimization techniques and exploiting multi-core solutions offered by two recently introduced architectures: Intel’s Woodcrest\ud general purpose processor and NVIDIA’s G80 graphical processing unit. Using these techniques for these architectures, a speedup of a factor three\ud has been achieved

CiteSeerX

University of Twente Research Information

Parallel Sparse Matrix Solver on the GPU Applied to Simulation of Electrical Machines

Author: Dekeyser Jean-Luc
Guyomarch Frédéric
Menach Yvonnick Le
Rodrigues Antonio Wendell De Oliveira
Publication venue
Publication date: 22/11/2009
Field of study

Nowadays, several industrial applications are being ported to parallel architectures. In fact, these platforms allow acquire more performance for system modelling and simulation. In the electric machines area, there are many problems which need speed-up on their solution. This paper examines the parallelism of sparse matrix solver on the graphics processors. More specifically, we implement the conjugate gradient technique with input matrix stored in CSR, and Symmetric CSR and CSC formats. This method is one of the most efficient iterative methods available for solving the finite-element basis functions of Maxwell's equations. The GPU (Graphics Processing Unit), which is used for its implementation, provides mechanisms to parallel the algorithm. Thus, it increases significantly the computation speed in relation to serial code on CPU based systems

arXiv.org e-Print Archive

HAL - Lille 3

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Matrix-free GPU implementation of a preconditioned conjugate gradient solver for anisotropic elliptic PDEs

Author: Guo Xu
Mueller Eike
Scheichl Robert
Shi Sinan
Publication venue
Publication date: 01/01/2013
Field of study

Many problems in geophysical and atmospheric modelling require the fast solution of elliptic partial differential equations (PDEs) in "flat" three dimensional geometries. In particular, an anisotropic elliptic PDE for the pressure correction has to be solved at every time step in the dynamical core of many numerical weather prediction models, and equations of a very similar structure arise in global ocean models, subsurface flow simulations and gas and oil reservoir modelling. The elliptic solve is often the bottleneck of the forecast, and an algorithmically optimal method has to be used and implemented efficiently. Graphics Processing Units have been shown to be highly efficient for a wide range of applications in scientific computing, and recently iterative solvers have been parallelised on these architectures. We describe the GPU implementation and optimisation of a Preconditioned Conjugate Gradient (PCG) algorithm for the solution of a three dimensional anisotropic elliptic PDE for the pressure correction in NWP. Our implementation exploits the strong vertical anisotropy of the elliptic operator in the construction of a suitable preconditioner. As the algorithm is memory bound, performance can be improved significantly by reducing the amount of global memory access. We achieve this by using a matrix-free implementation which does not require explicit storage of the matrix and instead recalculates the local stencil. Global memory access can also be reduced by rewriting the algorithm using loop fusion and we show that this further reduces the runtime on the GPU. We demonstrate the performance of our matrix-free GPU code by comparing it to a sequential CPU implementation and to a matrix-explicit GPU code which uses existing libraries. The absolute performance of the algorithm for different problem sizes is quantified in terms of floating point throughput and global memory bandwidth.Comment: 18 pages, 7 figure

arXiv.org e-Print Archive

CiteSeerX

OPUS

Crossref

Evaluation of Directive-Based GPU Programming Models on a Block Eigensolver with Consideration of Large Sparse Matrices

Author: A Dziekonski
AV Knyazev
AV Knyazev
C Yang
G Ortega
JW Choi
M Knap
M Shao
P Maris
P Maris
X Yang
Y Wang
Publication venue: eScholarship, University of California
Publication date: 01/01/2020
Field of study

Achieving high performance and performance portability for large-scale scientific applications is a major challenge on heterogeneous computing systems such as many-core CPUs and accelerators like GPUs. In this work, we implement a widely used block eigensolver, Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG), using two popular directive based programming models (OpenMP and OpenACC) for GPU-accelerated systems. Our work differs from existing work in that it adopts a holistic approach that optimizes the full solver performance rather than narrowing the problem into small kernels (e.g., SpMM, SpMV). Our LOPBCG GPU implementation achieves a 2.8

{\times }

–4.3

{\times }

speedup over an optimized CPU implementation when tested with four different input matrices. The evaluated configuration compared one Skylake CPU to one Skylake CPU and one NVIDIA V100 GPU. Our OpenMP and OpenACC LOBPCG GPU implementations gave nearly identical performance. We also consider how to create an efficient LOBPCG solver that can solve problems larger than GPU memory capacity. To this end, we create microbenchmarks representing the two dominant kernels (inner product and SpMM kernel) in LOBPCG and then evaluate performance when using two different programming approaches: tiling the kernels, and using Unified Memory with the original kernels. Our tiled SpMM implementation achieves a 2.9

{\times }

and 48.2

{\times }

speedup over the Unified Memory implementation on supercomputers with PCIe Gen3 and NVLink 2.0 CPU to GPU interconnects, respectively

Crossref

eScholarship - University of California

A Modeling Approach based on UML/MARTE for GPU Architecture

Author: Dekeyser Jean-Luc
Guyomarc'H Frédéric
Rodrigues Antonio Wendell De Oliveira
Publication venue
Publication date: 10/05/2011
Field of study

Nowadays, the High Performance Computing is part of the context of embedded systems. Graphics Processing Units (GPUs) are more and more used in acceleration of the most part of algorithms and applications. Over the past years, not many efforts have been done to describe abstractions of applications in relation to their target architectures. Thus, when developers need to associate applications and GPUs, for example, they find difficulty and prefer using API for these architectures. This paper presents a metamodel extension for MARTE profile and a model for GPU architectures. The main goal is to specify the task and data allocation in the memory hierarchy of these architectures. The results show that this approach will help to generate code for GPUs based on model transformations using Model Driven Engineering (MDE).Comment: Symposium en Architectures nouvelles de machines (SympA'14) (2011

arXiv.org e-Print Archive

HAL - Lille 3

INRIA a CCSD electronic archive server

Density Functional Theory calculation on many-cores hybrid CPU-GPU architectures

Author: Alexey Neelov
Goedecker S.
Jean-François Méhaut
Luigi Genovese
Matthieu Ospici
Stefan Goedecker
Thierry Deutsch
Publication venue
Publication date: 01/01/2009
Field of study

The implementation of a full electronic structure calculation code on a hybrid parallel architecture with Graphic Processing Units (GPU) is presented. The code which is on the basis of our implementation is a GNU-GPL code based on Daubechies wavelets. It shows very good performances, systematic convergence properties and an excellent efficiency on parallel computers. Our GPU-based acceleration fully preserves all these properties. In particular, the code is able to run on many cores which may or may not have a GPU associated. It is thus able to run on parallel and massive parallel hybrid environment, also with a non-homogeneous ratio CPU/GPU. With double precision calculations, we may achieve considerable speedup, between a factor of 20 for some operations and a factor of 6 for the whole DFT code.Comment: 14 pages, 8 figure

arXiv.org e-Print Archive

CiteSeerX

Crossref

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

edoc

HAL-CEA