230 research outputs found
A Householder-based algorithm for Hessenberg-triangular reduction
The QZ algorithm for computing eigenvalues and eigenvectors of a matrix
pencil requires that the matrices first be reduced to
Hessenberg-triangular (HT) form. The current method of choice for HT reduction
relies entirely on Givens rotations regrouped and accumulated into small dense
matrices which are subsequently applied using matrix multiplication routines. A
non-vanishing fraction of the total flop count must nevertheless still be
performed as sequences of overlapping Givens rotations alternately applied from
the left and from the right. The many data dependencies associated with this
computational pattern leads to inefficient use of the processor and poor
scalability.
In this paper, we therefore introduce a fundamentally different approach that
relies entirely on (large) Householder reflectors partially accumulated into
block reflectors, by using (compact) WY representations. Even though the new
algorithm requires more floating point operations than the state of the art
algorithm, extensive experiments on both real and synthetic data indicate that
it is still competitive, even in a sequential setting. The new algorithm is
conjectured to have better parallel scalability, an idea which is partially
supported by early small-scale experiments using multi-threaded BLAS. The
design and evaluation of a parallel formulation is future work
Banded Householder representation of linear subspaces
We show how to compactly represent any -dimensional subspace of as a
banded product of Householder reflections using floating point
numbers. This is optimal since these subspaces form a Grassmannian space
of dimension . The representation is stable and easy to
compute: any matrix can be factored into the product of a banded Householder
matrix and a square matrix using two to three QR decompositions.Comment: 5 pages, 1 figure, submitted to Linear Algebra and its Application
Minimizing Communication in Linear Algebra
In 1981 Hong and Kung proved a lower bound on the amount of communication
needed to perform dense, matrix-multiplication using the conventional
algorithm, where the input matrices were too large to fit in the small, fast
memory. In 2004 Irony, Toledo and Tiskin gave a new proof of this result and
extended it to the parallel case. In both cases the lower bound may be
expressed as (#arithmetic operations / ), where M is the size
of the fast memory (or local memory in the parallel case). Here we generalize
these results to a much wider variety of algorithms, including LU
factorization, Cholesky factorization, factorization, QR factorization,
algorithms for eigenvalues and singular values, i.e., essentially all direct
methods of linear algebra. The proof works for dense or sparse matrices, and
for sequential or parallel algorithms. In addition to lower bounds on the
amount of data moved (bandwidth) we get lower bounds on the number of messages
required to move it (latency). We illustrate how to extend our lower bound
technique to compositions of linear algebra operations (like computing powers
of a matrix), to decide whether it is enough to call a sequence of simpler
optimal algorithms (like matrix multiplication) to minimize communication, or
if we can do better. We give examples of both. We also show how to extend our
lower bounds to certain graph theoretic problems.
We point out recently designed algorithms for dense LU, Cholesky, QR,
eigenvalue and the SVD problems that attain these lower bounds; implementations
of LU and QR show large speedups over conventional linear algebra algorithms in
standard libraries like LAPACK and ScaLAPACK. Many open problems remain.Comment: 27 pages, 2 table
A 3D Parallel Algorithm for QR Decomposition
Interprocessor communication often dominates the runtime of large matrix
computations. We present a parallel algorithm for computing QR decompositions
whose bandwidth cost (communication volume) can be decreased at the cost of
increasing its latency cost (number of messages). By varying a parameter to
navigate the bandwidth/latency tradeoff, we can tune this algorithm for
machines with different communication costs
Fast rectangular matrix multiplication and QR decomposition
AbstractIn the last twenty-five years there has been much research into âfastâ matrix multiplication methods: ones that have an asymptotically smaller operation count than conventional multiplication. Most fast methods are derived for square matrices, but they can be applied to rectangular matrices by a blocking technique. We obtain an expression for the order of the operation count for this blocked multiplication of rectangular matrices. We derive an exact operation count for Strassen's method with rectangular matrices and determine the recursion threshold that minimizes the operation count. We also show that when Strassen's method is used to multiply rectangular matrices it is more efficient to use the method on the whole product than to apply the method to square submatrices. Fast multiplication methods can be exploited in calculating a QR decomposition of an m Ă n matrix. We show that the operation count can be reduced from O(mn2) to O(mn1+(1(4-α))) by using a fast multiplication method with exponent α in conjunction with Bischof and Van Loan's WY representation of a product of Householder transformations
A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures
As multicore systems continue to gain ground in the High Performance
Computing world, linear algebra algorithms have to be reformulated or new
algorithms have to be developed in order to take advantage of the architectural
features on these new processors. Fine grain parallelism becomes a major
requirement and introduces the necessity of loose synchronization in the
parallel execution of an operation. This paper presents an algorithm for the
Cholesky, LU and QR factorization where the operations can be represented as a
sequence of small tasks that operate on square blocks of data. These tasks can
be dynamically scheduled for execution based on the dependencies among them and
on the availability of computational resources. This may result in an out of
order execution of the tasks which will completely hide the presence of
intrinsically sequential tasks in the factorization. Performance comparisons
are presented with the LAPACK algorithms where parallelism can only be
exploited at the level of the BLAS operations and vendor implementations
Householder orthogonalization with a non-standard inner product
Householder orthogonalization plays an important role in numerical linear
algebra. It attains perfect orthogonality regardless of the conditioning of the
input. However, in the context of a non-standard inner product, it becomes
difficult to apply Householder orthogonalization due to the lack of an initial
orthogonal basis. We propose strategies to overcome this obstacle and discuss
algorithms and variants of Householder orthogonalization with a non-standard
inner product. Rounding error analysis and numerical experiments demonstrate
that our approach is numerically stable
- âŠ