1,517 research outputs found
A Householder-based algorithm for Hessenberg-triangular reduction
The QZ algorithm for computing eigenvalues and eigenvectors of a matrix
pencil requires that the matrices first be reduced to
Hessenberg-triangular (HT) form. The current method of choice for HT reduction
relies entirely on Givens rotations regrouped and accumulated into small dense
matrices which are subsequently applied using matrix multiplication routines. A
non-vanishing fraction of the total flop count must nevertheless still be
performed as sequences of overlapping Givens rotations alternately applied from
the left and from the right. The many data dependencies associated with this
computational pattern leads to inefficient use of the processor and poor
scalability.
In this paper, we therefore introduce a fundamentally different approach that
relies entirely on (large) Householder reflectors partially accumulated into
block reflectors, by using (compact) WY representations. Even though the new
algorithm requires more floating point operations than the state of the art
algorithm, extensive experiments on both real and synthetic data indicate that
it is still competitive, even in a sequential setting. The new algorithm is
conjectured to have better parallel scalability, an idea which is partially
supported by early small-scale experiments using multi-threaded BLAS. The
design and evaluation of a parallel formulation is future work
randUTV: A blocked randomized algorithm for computing a rank-revealing UTV factorization
This manuscript describes the randomized algorithm randUTV for computing a so
called UTV factorization efficiently. Given a matrix , the algorithm
computes a factorization , where and have orthonormal
columns, and is triangular (either upper or lower, whichever is preferred).
The algorithm randUTV is developed primarily to be a fast and easily
parallelized alternative to algorithms for computing the Singular Value
Decomposition (SVD). randUTV provides accuracy very close to that of the SVD
for problems such as low-rank approximation, solving ill-conditioned linear
systems, determining bases for various subspaces associated with the matrix,
etc. Moreover, randUTV produces highly accurate approximations to the singular
values of . Unlike the SVD, the randomized algorithm proposed builds a UTV
factorization in an incremental, single-stage, and non-iterative way, making it
possible to halt the factorization process once a specified tolerance has been
met. Numerical experiments comparing the accuracy and speed of randUTV to the
SVD are presented. These experiments demonstrate that in comparison to column
pivoted QR, which is another factorization that is often used as a relatively
economic alternative to the SVD, randUTV compares favorably in terms of speed
while providing far higher accuracy
Performance Analysis of a Novel GPU Computation-to-core Mapping Scheme for Robust Facet Image Modeling
Though the GPGPU concept is well-known
in image processing, much more work remains to be done
to fully exploit GPUs as an alternative computation
engine. This paper investigates the computation-to-core
mapping strategies to probe the efficiency and scalability
of the robust facet image modeling algorithm on GPUs.
Our fine-grained computation-to-core mapping scheme
shows a significant performance gain over the standard
pixel-wise mapping scheme. With in-depth performance
comparisons across the two different mapping schemes,
we analyze the impact of the level of parallelism on
the GPU computation and suggest two principles for
optimizing future image processing applications on the
GPU platform
QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
Previous studies have reported that common dense linear algebra operations do
not achieve speed up by using multiple geographical sites of a computational
grid. Because such operations are the building blocks of most scientific
applications, conventional supercomputers are still strongly predominant in
high-performance computing and the use of grids for speeding up large-scale
scientific problems is limited to applications exhibiting parallelism at a
higher level. We have identified two performance bottlenecks in the distributed
memory algorithms implemented in ScaLAPACK, a state-of-the-art dense linear
algebra library. First, because ScaLAPACK assumes a homogeneous communication
network, the implementations of ScaLAPACK algorithms lack locality in their
communication pattern. Second, the number of messages sent in the ScaLAPACK
algorithms is significantly greater than other algorithms that trade flops for
communication. In this paper, we present a new approach for computing a QR
factorization -- one of the main dense linear algebra kernels -- of tall and
skinny matrices in a grid computing environment that overcomes these two
bottlenecks. Our contribution is to articulate a recently proposed algorithm
(Communication Avoiding QR) with a topology-aware middleware (QCG-OMPI) in
order to confine intensive communications (ScaLAPACK calls) within the
different geographical sites. An experimental study conducted on the Grid'5000
platform shows that the resulting performance increases linearly with the
number of geographical sites on large-scale problems (and is in particular
consistently higher than ScaLAPACK's).Comment: Accepted at IPDPS10. (IEEE International Parallel & Distributed
Processing Symposium 2010 in Atlanta, GA, USA.
- …