230 research outputs found

    A Householder-based algorithm for Hessenberg-triangular reduction

    Full text link
    The QZ algorithm for computing eigenvalues and eigenvectors of a matrix pencil A−λBA - \lambda B requires that the matrices first be reduced to Hessenberg-triangular (HT) form. The current method of choice for HT reduction relies entirely on Givens rotations regrouped and accumulated into small dense matrices which are subsequently applied using matrix multiplication routines. A non-vanishing fraction of the total flop count must nevertheless still be performed as sequences of overlapping Givens rotations alternately applied from the left and from the right. The many data dependencies associated with this computational pattern leads to inefficient use of the processor and poor scalability. In this paper, we therefore introduce a fundamentally different approach that relies entirely on (large) Householder reflectors partially accumulated into block reflectors, by using (compact) WY representations. Even though the new algorithm requires more floating point operations than the state of the art algorithm, extensive experiments on both real and synthetic data indicate that it is still competitive, even in a sequential setting. The new algorithm is conjectured to have better parallel scalability, an idea which is partially supported by early small-scale experiments using multi-threaded BLAS. The design and evaluation of a parallel formulation is future work

    Banded Householder representation of linear subspaces

    Get PDF
    We show how to compactly represent any nn-dimensional subspace of RmR^m as a banded product of Householder reflections using n(m−n)n(m - n) floating point numbers. This is optimal since these subspaces form a Grassmannian space Grn(m)Gr_n(m) of dimension n(m−n)n(m - n). The representation is stable and easy to compute: any matrix can be factored into the product of a banded Householder matrix and a square matrix using two to three QR decompositions.Comment: 5 pages, 1 figure, submitted to Linear Algebra and its Application

    Minimizing Communication in Linear Algebra

    Full text link
    In 1981 Hong and Kung proved a lower bound on the amount of communication needed to perform dense, matrix-multiplication using the conventional O(n3)O(n^3) algorithm, where the input matrices were too large to fit in the small, fast memory. In 2004 Irony, Toledo and Tiskin gave a new proof of this result and extended it to the parallel case. In both cases the lower bound may be expressed as Ω\Omega(#arithmetic operations / M\sqrt{M}), where M is the size of the fast memory (or local memory in the parallel case). Here we generalize these results to a much wider variety of algorithms, including LU factorization, Cholesky factorization, LDLTLDL^T factorization, QR factorization, algorithms for eigenvalues and singular values, i.e., essentially all direct methods of linear algebra. The proof works for dense or sparse matrices, and for sequential or parallel algorithms. In addition to lower bounds on the amount of data moved (bandwidth) we get lower bounds on the number of messages required to move it (latency). We illustrate how to extend our lower bound technique to compositions of linear algebra operations (like computing powers of a matrix), to decide whether it is enough to call a sequence of simpler optimal algorithms (like matrix multiplication) to minimize communication, or if we can do better. We give examples of both. We also show how to extend our lower bounds to certain graph theoretic problems. We point out recently designed algorithms for dense LU, Cholesky, QR, eigenvalue and the SVD problems that attain these lower bounds; implementations of LU and QR show large speedups over conventional linear algebra algorithms in standard libraries like LAPACK and ScaLAPACK. Many open problems remain.Comment: 27 pages, 2 table

    A 3D Parallel Algorithm for QR Decomposition

    Full text link
    Interprocessor communication often dominates the runtime of large matrix computations. We present a parallel algorithm for computing QR decompositions whose bandwidth cost (communication volume) can be decreased at the cost of increasing its latency cost (number of messages). By varying a parameter to navigate the bandwidth/latency tradeoff, we can tune this algorithm for machines with different communication costs

    Fast rectangular matrix multiplication and QR decomposition

    Get PDF
    AbstractIn the last twenty-five years there has been much research into “fast” matrix multiplication methods: ones that have an asymptotically smaller operation count than conventional multiplication. Most fast methods are derived for square matrices, but they can be applied to rectangular matrices by a blocking technique. We obtain an expression for the order of the operation count for this blocked multiplication of rectangular matrices. We derive an exact operation count for Strassen's method with rectangular matrices and determine the recursion threshold that minimizes the operation count. We also show that when Strassen's method is used to multiply rectangular matrices it is more efficient to use the method on the whole product than to apply the method to square submatrices. Fast multiplication methods can be exploited in calculating a QR decomposition of an m × n matrix. We show that the operation count can be reduced from O(mn2) to O(mn1+(1(4-α))) by using a fast multiplication method with exponent α in conjunction with Bischof and Van Loan's WY representation of a product of Householder transformations

    A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures

    Full text link
    As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the Cholesky, LU and QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out of order execution of the tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithms where parallelism can only be exploited at the level of the BLAS operations and vendor implementations

    Householder orthogonalization with a non-standard inner product

    Full text link
    Householder orthogonalization plays an important role in numerical linear algebra. It attains perfect orthogonality regardless of the conditioning of the input. However, in the context of a non-standard inner product, it becomes difficult to apply Householder orthogonalization due to the lack of an initial orthogonal basis. We propose strategies to overcome this obstacle and discuss algorithms and variants of Householder orthogonalization with a non-standard inner product. Rounding error analysis and numerical experiments demonstrate that our approach is numerically stable
    • 

    corecore