1,425 research outputs found
Toward large-scale Hybrid Monte Carlo simulations of the Hubbard model on graphics processing units
The performance of the Hybrid Monte Carlo algorithm is determined by the
speed of sparse matrix-vector multiplication within the context of
preconditioned conjugate gradient iteration. We study these operations as
implemented for the fermion matrix of the Hubbard model in d+1 space-time
dimensions, and report a performance comparison between a 2.66 GHz Intel Xeon
E5430 CPU and an NVIDIA Tesla C1060 GPU using double-precision arithmetic. We
find speedup factors ranging between 30-350 for d = 1, and in excess of 40 for
d = 3. We argue that such speedups are of considerable impact for large-scale
simulational studies of quantum many-body systems.Comment: 8 pages, 5 figure
Implementing the conjugate gradient algorithm on multi-core systems
In linear solvers, like the conjugate gradient algorithm, sparse-matrix vector multiplication is an important kernel. Due to the sparseness of the matrices, the solver runs relatively slow. For digital optical tomography (DOT), a large set of linear equations have to be solved which currently takes in the order of hours on desktop computers. Our goal was to speed up the conjugate gradient solver. In this paper we present the results of applying multiple optimization techniques and exploiting multi-core solutions offered by two recently introduced architectures: Intel’s Woodcrest\ud
general purpose processor and NVIDIA’s G80 graphical processing unit. Using these techniques for these architectures, a speedup of a factor three\ud
has been achieved
Parallel Sparse Matrix Solver on the GPU Applied to Simulation of Electrical Machines
Nowadays, several industrial applications are being ported to parallel
architectures. In fact, these platforms allow acquire more performance for
system modelling and simulation. In the electric machines area, there are many
problems which need speed-up on their solution. This paper examines the
parallelism of sparse matrix solver on the graphics processors. More
specifically, we implement the conjugate gradient technique with input matrix
stored in CSR, and Symmetric CSR and CSC formats. This method is one of the
most efficient iterative methods available for solving the finite-element basis
functions of Maxwell's equations. The GPU (Graphics Processing Unit), which is
used for its implementation, provides mechanisms to parallel the algorithm.
Thus, it increases significantly the computation speed in relation to serial
code on CPU based systems
Matrix-free GPU implementation of a preconditioned conjugate gradient solver for anisotropic elliptic PDEs
Many problems in geophysical and atmospheric modelling require the fast
solution of elliptic partial differential equations (PDEs) in "flat" three
dimensional geometries. In particular, an anisotropic elliptic PDE for the
pressure correction has to be solved at every time step in the dynamical core
of many numerical weather prediction models, and equations of a very similar
structure arise in global ocean models, subsurface flow simulations and gas and
oil reservoir modelling. The elliptic solve is often the bottleneck of the
forecast, and an algorithmically optimal method has to be used and implemented
efficiently. Graphics Processing Units have been shown to be highly efficient
for a wide range of applications in scientific computing, and recently
iterative solvers have been parallelised on these architectures. We describe
the GPU implementation and optimisation of a Preconditioned Conjugate Gradient
(PCG) algorithm for the solution of a three dimensional anisotropic elliptic
PDE for the pressure correction in NWP. Our implementation exploits the strong
vertical anisotropy of the elliptic operator in the construction of a suitable
preconditioner. As the algorithm is memory bound, performance can be improved
significantly by reducing the amount of global memory access. We achieve this
by using a matrix-free implementation which does not require explicit storage
of the matrix and instead recalculates the local stencil. Global memory access
can also be reduced by rewriting the algorithm using loop fusion and we show
that this further reduces the runtime on the GPU. We demonstrate the
performance of our matrix-free GPU code by comparing it to a sequential CPU
implementation and to a matrix-explicit GPU code which uses existing libraries.
The absolute performance of the algorithm for different problem sizes is
quantified in terms of floating point throughput and global memory bandwidth.Comment: 18 pages, 7 figure
Evaluation of Directive-Based GPU Programming Models on a Block Eigensolver with Consideration of Large Sparse Matrices
Achieving high performance and performance portability for large-scale scientific applications is a major challenge on heterogeneous computing systems such as many-core CPUs and accelerators like GPUs. In this work, we implement a widely used block eigensolver, Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG), using two popular directive based programming models (OpenMP and OpenACC) for GPU-accelerated systems. Our work differs from existing work in that it adopts a holistic approach that optimizes the full solver performance rather than narrowing the problem into small kernels (e.g., SpMM, SpMV). Our LOPBCG GPU implementation achieves a 2.8–4.3 speedup over an optimized CPU implementation when tested with four different input matrices. The evaluated configuration compared one Skylake CPU to one Skylake CPU and one NVIDIA V100 GPU. Our OpenMP and OpenACC LOBPCG GPU implementations gave nearly identical performance. We also consider how to create an efficient LOBPCG solver that can solve problems larger than GPU memory capacity. To this end, we create microbenchmarks representing the two dominant kernels (inner product and SpMM kernel) in LOBPCG and then evaluate performance when using two different programming approaches: tiling the kernels, and using Unified Memory with the original kernels. Our tiled SpMM implementation achieves a 2.9 and 48.2 speedup over the Unified Memory implementation on supercomputers with PCIe Gen3 and NVLink 2.0 CPU to GPU interconnects, respectively
A Modeling Approach based on UML/MARTE for GPU Architecture
Nowadays, the High Performance Computing is part of the context of embedded
systems. Graphics Processing Units (GPUs) are more and more used in
acceleration of the most part of algorithms and applications. Over the past
years, not many efforts have been done to describe abstractions of applications
in relation to their target architectures. Thus, when developers need to
associate applications and GPUs, for example, they find difficulty and prefer
using API for these architectures. This paper presents a metamodel extension
for MARTE profile and a model for GPU architectures. The main goal is to
specify the task and data allocation in the memory hierarchy of these
architectures. The results show that this approach will help to generate code
for GPUs based on model transformations using Model Driven Engineering (MDE).Comment: Symposium en Architectures nouvelles de machines (SympA'14) (2011
Density Functional Theory calculation on many-cores hybrid CPU-GPU architectures
The implementation of a full electronic structure calculation code on a
hybrid parallel architecture with Graphic Processing Units (GPU) is presented.
The code which is on the basis of our implementation is a GNU-GPL code based on
Daubechies wavelets. It shows very good performances, systematic convergence
properties and an excellent efficiency on parallel computers. Our GPU-based
acceleration fully preserves all these properties. In particular, the code is
able to run on many cores which may or may not have a GPU associated. It is
thus able to run on parallel and massive parallel hybrid environment, also with
a non-homogeneous ratio CPU/GPU. With double precision calculations, we may
achieve considerable speedup, between a factor of 20 for some operations and a
factor of 6 for the whole DFT code.Comment: 14 pages, 8 figure
- …