Search CORE

13,013 research outputs found

A GPU-based hyperbolic SVD algorithm

Author: A.H. Sameh
F.T. Luk
F.T. Luk
G.S. Sachdev
H. Zha
I. Slapničar
I. Slapničar
I. Slapničar
J.R. Bunch
K. Veselić
R. Mathias
R.P. Brent
S. Lahabar
S. Singer
S. Singer
S. Singer
S. Zhang
Sanja Singer
V. Hari
V. Hari
Vedran Novaković
Z. Drmač
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

A one-sided Jacobi hyperbolic singular value decomposition (HSVD) algorithm, using a massively parallel graphics processing unit (GPU), is developed. The algorithm also serves as the final stage of solving a symmetric indefinite eigenvalue problem. Numerical testing demonstrates the gains in speed and accuracy over sequential and MPI-parallelized variants of similar Jacobi-type HSVD algorithms. Finally, possibilities of hybrid CPU--GPU parallelism are discussed.Comment: Accepted for publication in BIT Numerical Mathematic

arXiv.org e-Print Archive

CiteSeerX

Crossref

FAMENA Repository

A hierarchically blocked Jacobi SVD algorithm for single and multiple graphics processing units

Author: Novaković Vedran
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 27/09/2014
Field of study

We present a hierarchically blocked one-sided Jacobi algorithm for the singular value decomposition (SVD), targeting both single and multiple graphics processing units (GPUs). The blocking structure reflects the levels of GPU's memory hierarchy. The algorithm may outperform MAGMA's dgesvd, while retaining high relative accuracy. To this end, we developed a family of parallel pivot strategies on GPU's shared address space, but applicable also to inter-GPU communication. Unlike common hybrid approaches, our algorithm in a single GPU setting needs a CPU for the controlling purposes only, while utilizing GPU's resources to the fullest extent permitted by the hardware. When required by the problem size, the algorithm, in principle, scales to an arbitrary number of GPU nodes. The scalability is demonstrated by more than twofold speedup for sufficiently large matrices on a Tesla S2050 system with four GPUs vs. a single Fermi card.Comment: Accepted for publication in SIAM Journal on Scientific Computin

arXiv.org e-Print Archive

CiteSeerX

Minimizing Communication for Eigenproblems and the Singular Value Decomposition

Author: Ballard Grey
Demmel James
Dumitriu Ioana
Publication venue
Publication date: 01/01/2010
Field of study

Algorithms have two costs: arithmetic and communication. The latter represents the cost of moving data, either between levels of a memory hierarchy, or between processors over a network. Communication often dominates arithmetic and represents a rapidly increasing proportion of the total cost, so we seek algorithms that minimize communication. In \cite{BDHS10} lower bounds were presented on the amount of communication required for essentially all

O(n^3)

-like algorithms for linear algebra, including eigenvalue problems and the SVD. Conventional algorithms, including those currently implemented in (Sca)LAPACK, perform asymptotically more communication than these lower bounds require. In this paper we present parallel and sequential eigenvalue algorithms (for pencils, nonsymmetric matrices, and symmetric matrices) and SVD algorithms that do attain these lower bounds, and analyze their convergence and communication costs.Comment: 43 pages, 11 figure

arXiv.org e-Print Archive

CiteSeerX

Parallel accelerated cyclic reduction preconditioner for three-dimensional elliptic PDEs with variable coefficients

Author: Chávez Gustavo
Keyes David
Turkiyyah George
Zampini Stefano
Publication venue: 'Elsevier BV'
Publication date: 23/12/2017
Field of study

We present a robust and scalable preconditioner for the solution of large-scale linear systems that arise from the discretization of elliptic PDEs amenable to rank compression. The preconditioner is based on hierarchical low-rank approximations and the cyclic reduction method. The setup and application phases of the preconditioner achieve log-linear complexity in memory footprint and number of operations, and numerical experiments exhibit good weak and strong scalability at large processor counts in a distributed memory environment. Numerical experiments with linear systems that feature symmetry and nonsymmetry, definiteness and indefiniteness, constant and variable coefficients demonstrate the preconditioner applicability and robustness. Furthermore, it is possible to control the number of iterations via the accuracy threshold of the hierarchical matrix approximations and their arithmetic operations, and the tuning of the admissibility condition parameter. Together, these parameters allow for optimization of the memory requirements and performance of the preconditioner.Comment: 24 pages, Elsevier Journal of Computational and Applied Mathematics, Dec 201

arXiv.org e-Print Archive

eScholarship - University of California

Interactive boundary element analysis for engineering design.

Author: Coates G.
Foster T.M.
Mohamed M.S.
Trevelyan J.
Publication venue: University of Durham, School of Engineering
Publication date: 01/01/2013
Field of study

Structural design of mechanical components is an iterative process that involves multiple stress analysis runs; this can be time consuming and expensive. Significant improvements in the eciency of this process can be made by increasing the level of interactivity. One approach is through real-time re-analysis of models with continuously updating geometry. Three primary areas need to be considered to accelerate the re-solution of boundary element problems. These are re-meshing the model, updating the boundary element system of equations and re-solution of the system. Once the initial model has been constructed and solved, the user may apply geometric perturbations to parts of the model. The re-meshing algorithm must accommodate these changes in geometry whilst retaining as much of the existing mesh as possible. This allows the majority of the previous boundary element system of equations to be re-used for the new analysis. For this problem, a GMRES solver has been shown to provide the fastest convergence rate. Further time savings can be made by preconditioning the updated system with the LU decomposition of the original system. Using these techniques, near real-time analysis can be achieved for 3D simulations; for 2D models such real-time performance has already been demonstrated

Durham Research Online

Heriot Watt Pure