329 research outputs found
Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors
The QR decomposition with column pivoting (QRP) of a matrix is widely used for rank revealing. The performance of LAPACK implementation (DGEQP3) of the Householder QRP algorithm is limited by Level 2 BLAS operations required for updating the column norms. In this paper, we propose an implementation of the QRP algorithm using a distribution of the matrix columns in a round-robin fashion for better data locality and parallel memory bus utilization on multicore architectures. Our performance results show a 60% improvement over the routine DGEQP3 of Intel MKL (version 10.3) on a 12 core Intel Xeon X5670 machine. In addition, we show that the same data distribution is also suitable for general purpose GPU processors, where our implementation obtains up to 90 GFlops on a NVIDIA GeForce GTX480. This is about 2 times faster than the QRP implementation of MAGMA (version 1.2.1).Tom ́as and Bai were supported in part by the U.S. DOES ciDAC grant DOE-DE-FC0206ER25793 and NSF grant PHY1005502. This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. DOE under Contract No. DE-AC02-05CH11231.Tomás Domínguez, AE.; Bai, Z.; Hernández García, V. (2013). Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors. En High Performance Computing for Computational Science - VECPAR 2012. Springer Verlag (Germany): Series. 50-58. https://doi.org/10.1007/978-3-642-38718-0_8S5058Bischof, C.H.: A parallel QR factorization algorithm with controlled local pivoting. SIAM J. Sci. Stat. Comput. 12, 36–57 (1991)Chandrasekaran, S., Ipsen, I.C.F.: On rank-revealing factorisations. SIAM J. Matrix Anal. Appl. 15, 592–622 (1994)Castaldo, A.M., Whaley, R.C.: Scaling LAPACK panel operations using parallel cache assignment. In: 15th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, pp. 223–231 (2010)Drmač, Z., Bujanović, Z.: On the failure of rank-revealing QR factorization software – a case study. ACM Trans. Math. Softw. 35, 12:1–12:28 (2008)Drmač, Z., Veselić, K.: New fast and accurate Jacobi SVD algorithm I. SIAM J. Matrix Anal. Appl. 29, 1322–1342 (2008)Drmač, Z., Veselić, K.: New fast and accurate Jacobi SVD algorithm II. SIAM J. Matrix Anal. Appl. 29, 1343–1362 (2008)Golub, G.H.: Numerical methods for solving linear least squares problems. Numer. Math. 7, 206–216 (1965)Gu, M., Eisenstat, S.: Efficient algorithms for computing a strong rank-revealing QR factorization. SIAM J. Sci. Comput. 17, 848–869 (1996)Quintana-Orti, G., Sun, X., Bischof, C.H.: A BLAS-3 version of the QR factorization with column pivoting. SIAM J. Sci. Comput. 19, 1486–1494 (1998)Schreiber, R., van Loan, C.: A storage-efficient WY representation for products of Householder transformations. SIAM J. Sci. Stat. Comput. 10, 53–57 (1989
A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures
As multicore systems continue to gain ground in the High Performance
Computing world, linear algebra algorithms have to be reformulated or new
algorithms have to be developed in order to take advantage of the architectural
features on these new processors. Fine grain parallelism becomes a major
requirement and introduces the necessity of loose synchronization in the
parallel execution of an operation. This paper presents an algorithm for the
Cholesky, LU and QR factorization where the operations can be represented as a
sequence of small tasks that operate on square blocks of data. These tasks can
be dynamically scheduled for execution based on the dependencies among them and
on the availability of computational resources. This may result in an out of
order execution of the tasks which will completely hide the presence of
intrinsically sequential tasks in the factorization. Performance comparisons
are presented with the LAPACK algorithms where parallelism can only be
exploited at the level of the BLAS operations and vendor implementations
Hybrid static/dynamic scheduling for already optimized dense matrix factorization
We present the use of a hybrid static/dynamic scheduling strategy of the task
dependency graph for direct methods used in dense numerical linear algebra.
This strategy provides a balance of data locality, load balance, and low
dequeue overhead. We show that the usage of this scheduling in communication
avoiding dense factorization leads to significant performance gains. On a 48
core AMD Opteron NUMA machine, our experiments show that we can achieve up to
64% improvement over a version of CALU that uses fully dynamic scheduling, and
up to 30% improvement over the version of CALU that uses fully static
scheduling. On a 16-core Intel Xeon machine, our hybrid static/dynamic
scheduling approach is up to 8% faster than the version of CALU that uses a
fully static scheduling or fully dynamic scheduling. Our algorithm leads to
speedups over the corresponding routines for computing LU factorization in well
known libraries. On the 48 core AMD NUMA machine, our best implementation is up
to 110% faster than MKL, while on the 16 core Intel Xeon machine, it is up to
82% faster than MKL. Our approach also shows significant speedups compared with
PLASMA on both of these systems
Fast Parallel Randomized QR with Column Pivoting Algorithms for Reliable Low-rank Matrix Approximations
Factorizing large matrices by QR with column pivoting (QRCP) is substantially
more expensive than QR without pivoting, owing to communication costs required
for pivoting decisions. In contrast, randomized QRCP (RQRCP) algorithms have
proven themselves empirically to be highly competitive with high-performance
implementations of QR in processing time, on uniprocessor and shared memory
machines, and as reliable as QRCP in pivot quality.
We show that RQRCP algorithms can be as reliable as QRCP with failure
probabilities exponentially decaying in oversampling size. We also analyze
efficiency differences among different RQRCP algorithms. More importantly, we
develop distributed memory implementations of RQRCP that are significantly
better than QRCP implementations in ScaLAPACK.
As a further development, we introduce the concept of and develop algorithms
for computing spectrum-revealing QR factorizations for low-rank matrix
approximations, and demonstrate their effectiveness against leading low-rank
approximation methods in both theoretical and numerical reliability and
efficiency.Comment: 11 pages, 14 figures, accepted by 2017 IEEE 24th International
Conference on High Performance Computing (HiPC), awarded the best paper priz
A fast solver for linear systems with displacement structure
We describe a fast solver for linear systems with reconstructable Cauchy-like
structure, which requires O(rn^2) floating point operations and O(rn) memory
locations, where n is the size of the matrix and r its displacement rank. The
solver is based on the application of the generalized Schur algorithm to a
suitable augmented matrix, under some assumptions on the knots of the
Cauchy-like matrix. It includes various pivoting strategies, already discussed
in the literature, and a new algorithm, which only requires reconstructability.
We have developed a software package, written in Matlab and C-MEX, which
provides a robust implementation of the above method. Our package also includes
solvers for Toeplitz(+Hankel)-like and Vandermonde-like linear systems, as
these structures can be reduced to Cauchy-like by fast and stable transforms.
Numerical experiments demonstrate the effectiveness of the software.Comment: 27 pages, 6 figure
An efficient multi-core implementation of a novel HSS-structured multifrontal solver using randomized sampling
We present a sparse linear system solver that is based on a multifrontal
variant of Gaussian elimination, and exploits low-rank approximation of the
resulting dense frontal matrices. We use hierarchically semiseparable (HSS)
matrices, which have low-rank off-diagonal blocks, to approximate the frontal
matrices. For HSS matrix construction, a randomized sampling algorithm is used
together with interpolative decompositions. The combination of the randomized
compression with a fast ULV HSS factorization leads to a solver with lower
computational complexity than the standard multifrontal method for many
applications, resulting in speedups up to 7 fold for problems in our test
suite. The implementation targets many-core systems by using task parallelism
with dynamic runtime scheduling. Numerical experiments show performance
improvements over state-of-the-art sparse direct solvers. The implementation
achieves high performance and good scalability on a range of modern shared
memory parallel systems, including the Intel Xeon Phi (MIC). The code is part
of a software package called STRUMPACK -- STRUctured Matrices PACKage, which
also has a distributed memory component for dense rank-structured matrices
randUTV: A blocked randomized algorithm for computing a rank-revealing UTV factorization
This manuscript describes the randomized algorithm randUTV for computing a so
called UTV factorization efficiently. Given a matrix , the algorithm
computes a factorization , where and have orthonormal
columns, and is triangular (either upper or lower, whichever is preferred).
The algorithm randUTV is developed primarily to be a fast and easily
parallelized alternative to algorithms for computing the Singular Value
Decomposition (SVD). randUTV provides accuracy very close to that of the SVD
for problems such as low-rank approximation, solving ill-conditioned linear
systems, determining bases for various subspaces associated with the matrix,
etc. Moreover, randUTV produces highly accurate approximations to the singular
values of . Unlike the SVD, the randomized algorithm proposed builds a UTV
factorization in an incremental, single-stage, and non-iterative way, making it
possible to halt the factorization process once a specified tolerance has been
met. Numerical experiments comparing the accuracy and speed of randUTV to the
SVD are presented. These experiments demonstrate that in comparison to column
pivoted QR, which is another factorization that is often used as a relatively
economic alternative to the SVD, randUTV compares favorably in terms of speed
while providing far higher accuracy
Three-Level Parallel J-Jacobi Algorithms for Hermitian Matrices
The paper describes several efficient parallel implementations of the
one-sided hyperbolic Jacobi-type algorithm for computing eigenvalues and
eigenvectors of Hermitian matrices. By appropriate blocking of the algorithms
an almost ideal load balancing between all available processors/cores is
obtained. A similar blocking technique can be used to exploit local cache
memory of each processor to further speed up the process. Due to diversity of
modern computer architectures, each of the algorithms described here may be the
method of choice for a particular hardware and a given matrix size. All
proposed block algorithms compute the eigenvalues with relative accuracy
similar to the original non-blocked Jacobi algorithm.Comment: Submitted for publicatio
- …