12 research outputs found
randUTV: A blocked randomized algorithm for computing a rank-revealing UTV factorization
This manuscript describes the randomized algorithm randUTV for computing a so
called UTV factorization efficiently. Given a matrix , the algorithm
computes a factorization , where and have orthonormal
columns, and is triangular (either upper or lower, whichever is preferred).
The algorithm randUTV is developed primarily to be a fast and easily
parallelized alternative to algorithms for computing the Singular Value
Decomposition (SVD). randUTV provides accuracy very close to that of the SVD
for problems such as low-rank approximation, solving ill-conditioned linear
systems, determining bases for various subspaces associated with the matrix,
etc. Moreover, randUTV produces highly accurate approximations to the singular
values of . Unlike the SVD, the randomized algorithm proposed builds a UTV
factorization in an incremental, single-stage, and non-iterative way, making it
possible to halt the factorization process once a specified tolerance has been
met. Numerical experiments comparing the accuracy and speed of randUTV to the
SVD are presented. These experiments demonstrate that in comparison to column
pivoted QR, which is another factorization that is often used as a relatively
economic alternative to the SVD, randUTV compares favorably in terms of speed
while providing far higher accuracy
Solución de Problemas Matriciales de “Gran Escala” sobre Procesadores Multinúcleo y GPUs
Few realize that, for large matrices, many dense matrix computations achieve nearly the same performance
when the matrices are stored on disk as when they are stored in a very large main memory. Similarly, few realize that, given
the right programming abstractions, coding Out-of-Core (OOC) implementations of dense linear algebra operations (where
data resides on disk and has to be explicitly moved in and out of main memory) is no more difficult than programming
high-performance implementations for the case where the matrix is in memory. Finally, few realize that on a contemporary
eight core architecture or a platform equiped with a graphics processor (GPU) one can solve a 100, 000 × 100, 000
symmetric positive definite linear system in about one hour. Thus, for problems that used to be considered large, it is not
necessary to utilize distributed-memory architectures with massive memories if one is willing to wait longer for the solution
to be computed on a fast multithreaded architecture like a multi-core computer or a GPU. This paper provides evidence in
support of these claimsPocos son conscientes de que, para matrices grandes, muchos cálculos matriciales obtienen casi el mismo rendimiento
cuando las matrices se encuentran almacenadas en disco que cuando residen en una memoria principal muy grande. De
manera parecida, pocos son conscientes de que, si se usan las abstracciones de programacón correctas, codificar algoritmos
Out-of-Core (OOC) para operaciones de Álgebra matricial densa (donde los datos residen en disco y tienen que moverse
explícitamente entre memoria principal y disco) no resulta más difícil que codificar algoritmos de altas prestaciones para
matrices que residen en memoria principal. Finalmente, pocos son conscientes de que en una arquictura actual con 8 núcleos
o un equipo con un procesador gráfico (GPU) es posible resolver un sistema lineal simétrico positivo definido de dimensión
100,000 × 100,000 aproximadamente en una hora. Así, para problemas que solían considerarse grandes, no es necesario
usar arquitecturas de memoria distribuida con grandes memorias si uno está dispuesto a esperar un cierto tiempo para que
la solución se obtenga en una arquitectura multihebra como un procesador multinúcleo o una GPU. Este trabajo presenta
evidencias que soportan tales afirmaciones
Programming matrix algorithms-by-blocks for thread-level parallelism
With the emergence of thread-level parallelism as the primary means for continued improvement of performance, the programmability issue has reemerged as an obstacle to the use of architectural advances. We argue that evolving legacy libraries for dense and banded linear algebra is not a viable solution due to constraints imposed by early design decisions. We propose a philosophy of abstraction and separation of concerns that provides a promising solution in this problem domain. The first abstraction, FLASH, allows algorithms to express computation with matrices consisting of blocks, facilitating algorithms-by-blocks. Transparent to the library implementor, operand descriptions are registered for a particular operation a priori. A runtime system, SuperMatrix, uses this information to identify data dependencies between suboperations, allowing them to be scheduled to threads out-of-order and executed in parallel. But not all classical algorithms in linear algebra lend themselves to conversion to algorithms-by-blocks. We show how our recently proposed LU factorization with incremental pivoting and closely related algorithm-by-blocks for the QR factorization, both originally designed for out-of-core computation, overcome this difficulty. Anecdotal evidence regarding the development of routines with a core functionality demonstrates how the methodology supports high productivity while experimental results suggest that high performance is abundantly achievabl
College of Engineering
Cornell University Courses of Study V.80 1988/8
Block-Partitioned Algorithms for Solving the Linear Least Squares Problem
The linear least squares problem arises in many areas of sciences and engineerings. When the coefficient matrix has full rank, the solution can be obtained in a fast way by using QR factorization with BLAS-3. In contrast, when the matrix is rank-deficient, or the rank is unknown, other slower methods should be applied: the SVD or the complete orthogonal decompositions. The SVD gives more reliable determination of rank but is computationally more expensive. On the other hand, the complete orthogonal decomposition is faster and in practice works well
Using desktop computers to solve large-scale dense linear algebra problems
We provide experimental evidence that current desktop computers feature enough computational power to solve large-scale dense linear algebra problems. While the high computational cost of the numerical methods for solving these problems can be tackled by the multiple cores of current processors, we propose to use the disk to store the large data structures associated with these applications. Our results also show that the limited amount of RAM and the comparatively slow disk of the system pose no problem for the solution of very large dense linear systems and linear least-squares problems. Thus, current desktop computers are revealed as an appealing, cost-effective platform for research groups that have to deal with large dense linear algebra problems but have no direct access to large computing facilities
Exploiting the capabilities of modern GPUs for dense matrix computations
We present several algorithms to compute the solution of a linear system of equations on a graphics processor (GPU), as well as general techniques to improve their performance, such as padding and hybrid GPU-CPU computation. We compare single and double precision performance of a modern GPU with unified architecture, and show how iterative refinement with mixed precision can be used to regain full accuracy in the solution of linear systems, exploiting the potential of the processor for single precision arithmetic. Experimental results on a GTX280 using CUBLAS 2.0, the implementation of BLAS for NVIDIA® GPUs with unified architecture, illustrate the performance of the different algorithms and techniques proposed
A Runtime System for Programming Out-of-Core Matrix Algorithms-by-Tiles on Multithreaded Architectures
Out-of-core implementations of algorithms for dense matrix computations have traditionally focused on optimal use of memory so as to minimize I/O, often trading programmability for performance. In this article we
show how the current state of hardware and software allows the programmability problem to be addressed
without sacrificing performance. This comes from the realizations that memory is cheap and large, making it less necessary to optimally orchestrate I/O, and that new algorithms view matrices as collections of
submatrices and computation as operations with those submatrices. This enables libraries to be coded at a
high level of abstraction, leaving the tasks of scheduling the computations and data movement in the hands
of a runtime system. This is in sharp contrast to more traditional approaches that leverage optimal use of
in-core memory and, at the expense of introducing considerable programming complexity, explicit overlap of
I/O with computation. Performance is demonstrated for this approach on multicore architectures as well as
platforms equipped with hardware accelerators