    Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

    The QR decomposition with column pivoting (QRP) of a matrix is widely used for rank revealing. The performance of LAPACK implementation (DGEQP3) of the Householder QRP algorithm is limited by Level 2 BLAS operations required for updating the column norms. In this paper, we propose an implementation of the QRP algorithm using a distribution of the matrix columns in a round-robin fashion for better data locality and parallel memory bus utilization on multicore architectures. Our performance results show a 60% improvement over the routine DGEQP3 of Intel MKL (version 10.3) on a 12 core Intel Xeon X5670 machine. In addition, we show that the same data distribution is also suitable for general purpose GPU processors, where our implementation obtains up to 90 GFlops on a NVIDIA GeForce GTX480. This is about 2 times faster than the QRP implementation of MAGMA (version 1.2.1).Tom ́as and Bai were supported in part by the U.S. DOES ciDAC grant DOE-DE-FC0206ER25793 and NSF grant PHY1005502. This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. DOE under Contract No. DE-AC02-05CH11231.Tomás Domínguez, AE.; Bai, Z.; Hernández García, V. (2013). Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors. En High Performance Computing for Computational Science - VECPAR 2012. Springer Verlag (Germany): Series. 50-58. https://doi.org/10.1007/978-3-642-38718-0_8S5058

    Novel Modifications of Parallel Jacobi Algorithms

    We describe two main classes of one-sided trigonometric and hyperbolic Jacobi-type algorithms for computing eigenvalues and eigenvectors of Hermitian matrices. These types of algorithms exhibit significant advantages over many other eigenvalue algorithms. If the matrices permit, both types of algorithms compute the eigenvalues and eigenvectors with high relative accuracy. We present novel parallelization techniques for both trigonometric and hyperbolic classes of algorithms, as well as some new ideas on how pivoting in each cycle of the algorithm can improve the speed of the parallel one-sided algorithms. These parallelization approaches are applicable to both distributed-memory and shared-memory machines. The numerical testing performed indicates that the hyperbolic algorithms may be superior to the trigonometric ones, although, in theory, the latter seem more natural.Comment: Accepted for publication in Numerical Algorithm

    On Temple--Kato like inequalities and applications

    We give both lower and upper estimates for eigenvalues of unbounded positive definite operators in an arbitrary Hilbert space. We show scaling robust relative eigenvalue estimates for these operators in analogy to such estimates of current interest in Numerical Linear Algebra. Only simple matrix theoretic tools like Schur complements have been used. As prototypes for the strength of our method we discuss a singularly perturbed Schroedinger operator and study convergence estimates for finite element approximations. The estimates can be viewed as a natural quadratic form version of the celebrated Temple--Kato inequality.Comment: submitted to SIAM Journal on Numerical Analysis (a major revision of the paper

    Convergence of the Eberlein diagonalization method under the generalized serial pivot strategies

    The Eberlein method is a Jacobi-type process for solving the eigenvalue problem of an arbitrary matrix. In each iteration two transformations are applied on the underlying matrix, a plane rotation and a non-unitary elementary transformation. The paper studies the method under the broad class of generalized serial pivot strategies. We prove the global convergence of the Eberlein method under the generalized serial pivot strategies with permutations and present several numerical examples.Comment: 16 pages, 3 figure

    Accurate solution of structured least squares problems via rank-revealing decompositions

    Least squares problems min(x) parallel to b - Ax parallel to(2) where the matrix A is an element of C-mXn (m >= n) has some particular structure arise frequently in applications. Polynomial data fitting is a well-known instance of problems that yield highly structured matrices, but many other examples exist. Very often, structured matrices have huge condition numbers kappa(2)(A) = parallel to A parallel to(2) parallel to A(dagger)parallel to(2) (A(dagger) is the Moore-Penrose pseudoinverse of A) and therefore standard algorithms fail to compute accurate minimum 2-norm solutions of least squares problems. In this work, we introduce a framework that allows us to compute minimum 2-norm solutions of many classes of structured least squares problems accurately, i.e., with errors parallel to(x) over cap (0) - x(0)parallel to(2)/parallel to x(0)parallel to(2) = O(u), where u is the unit roundoff, independently of the magnitude of kappa(2)(A) for most vectors b. The cost of these accurate computations is O(n(2)m) flops, i.e., roughly the same cost as standard algorithms for least squares problems. The approach in this work relies in computing first an accurate rank-revealing decomposition of A, an idea that has been widely used in recent decades to compute, for structured ill-conditioned matrices, singular value decompositions, eigenvalues, and eigenvectors in the Hermitian case and solutions of linear systems with high relative accuracy. In order to prove that accurate solutions are computed, a new multiplicative perturbation theory of the least squares problem is needed. The results presented in this paper are valid for both full rank and rank deficient problems and also in the case of underdetermined linear systems (m < n). Among other types of matrices, the new method applies to rectangular Cauchy, Vandermonde, and graded matrices, and detailed numerical tests for Cauchy matrices are presented.This work was supported by the Ministerio de Economía y Competitividad of Spain through grants MTM-2009-09281, MTM-2012-32542 (Ceballos, Dopico, and Molera) and MTM2010-18057 (Castro-González).Publicad

    A hierarchically blocked Jacobi SVD algorithm for single and multiple graphics processing units

    We present a hierarchically blocked one-sided Jacobi algorithm for the singular value decomposition (SVD), targeting both single and multiple graphics processing units (GPUs). The blocking structure reflects the levels of GPU's memory hierarchy. The algorithm may outperform MAGMA's dgesvd, while retaining high relative accuracy. To this end, we developed a family of parallel pivot strategies on GPU's shared address space, but applicable also to inter-GPU communication. Unlike common hybrid approaches, our algorithm in a single GPU setting needs a CPU for the controlling purposes only, while utilizing GPU's resources to the fullest extent permitted by the hardware. When required by the problem size, the algorithm, in principle, scales to an arbitrary number of GPU nodes. The scalability is demonstrated by more than twofold speedup for sufficiently large matrices on a Tesla S2050 system with four GPUs vs. a single Fermi card.Comment: Accepted for publication in SIAM Journal on Scientific Computin