293 research outputs found
A hierarchically blocked Jacobi SVD algorithm for single and multiple graphics processing units
We present a hierarchically blocked one-sided Jacobi algorithm for the
singular value decomposition (SVD), targeting both single and multiple graphics
processing units (GPUs). The blocking structure reflects the levels of GPU's
memory hierarchy. The algorithm may outperform MAGMA's dgesvd, while retaining
high relative accuracy. To this end, we developed a family of parallel pivot
strategies on GPU's shared address space, but applicable also to inter-GPU
communication. Unlike common hybrid approaches, our algorithm in a single GPU
setting needs a CPU for the controlling purposes only, while utilizing GPU's
resources to the fullest extent permitted by the hardware. When required by the
problem size, the algorithm, in principle, scales to an arbitrary number of GPU
nodes. The scalability is demonstrated by more than twofold speedup for
sufficiently large matrices on a Tesla S2050 system with four GPUs vs. a single
Fermi card.Comment: Accepted for publication in SIAM Journal on Scientific Computin
The LAPW method with eigendecomposition based on the Hari--Zimmermann generalized hyperbolic SVD
In this paper we propose an accurate, highly parallel algorithm for the
generalized eigendecomposition of a matrix pair , given in a factored
form . Matrices and are generally complex
and Hermitian, and is positive definite. This type of matrices emerges from
the representation of the Hamiltonian of a quantum mechanical system in terms
of an overcomplete set of basis functions. This expansion is part of a class of
models within the broad field of Density Functional Theory, which is considered
the golden standard in condensed matter physics. The overall algorithm consists
of four phases, the second and the fourth being optional, where the two last
phases are computation of the generalized hyperbolic SVD of a complex matrix
pair , according to a given matrix defining the hyperbolic scalar
product. If , then these two phases compute the GSVD in parallel very
accurately and efficiently.Comment: The supplementary material is available at
https://web.math.pmf.unizg.hr/mfbda/papers/sm-SISC.pdf due to its size. This
revised manuscript is currently being considered for publicatio
A Self-learning Algebraic Multigrid Method for Extremal Singular Triplets and Eigenpairs
A self-learning algebraic multigrid method for dominant and minimal singular
triplets and eigenpairs is described. The method consists of two multilevel
phases. In the first, multiplicative phase (setup phase), tentative singular
triplets are calculated along with a multigrid hierarchy of interpolation
operators that approximately fit the tentative singular vectors in a collective
and self-learning manner, using multiplicative update formulas. In the second,
additive phase (solve phase), the tentative singular triplets are improved up
to the desired accuracy by using an additive correction scheme with fixed
interpolation operators, combined with a Ritz update. A suitable generalization
of the singular value decomposition is formulated that applies to the coarse
levels of the multilevel cycles. The proposed algorithm combines and extends
two existing multigrid approaches for symmetric positive definite eigenvalue
problems to the case of dominant and minimal singular triplets. Numerical tests
on model problems from different areas show that the algorithm converges to
high accuracy in a modest number of iterations, and is flexible enough to deal
with a variety of problems due to its self-learning properties.Comment: 29 page
Parallel eigenanalysis of finite element models in a completely connected architecture
A parallel algorithm is presented for the solution of the generalized eigenproblem in linear elastic finite element analysis, (K)(phi) = (M)(phi)(omega), where (K) and (M) are of order N, and (omega) is order of q. The concurrent solution of the eigenproblem is based on the multifrontal/modified subspace method and is achieved in a completely connected parallel architecture in which each processor is allowed to communicate with all other processors. The algorithm was successfully implemented on a tightly coupled multiple-instruction multiple-data parallel processing machine, Cray X-MP. A finite element model is divided into m domains each of which is assumed to process n elements. Each domain is then assigned to a processor or to a logical processor (task) if the number of domains exceeds the number of physical processors. The macrotasking library routines are used in mapping each domain to a user task. Computational speed-up and efficiency are used to determine the effectiveness of the algorithm. The effect of the number of domains, the number of degrees-of-freedom located along the global fronts and the dimension of the subspace on the performance of the algorithm are investigated. A parallel finite element dynamic analysis program, p-feda, is documented and the performance of its subroutines in parallel environment is analyzed
- …