Search CORE

72 research outputs found

LU factorization with panel rank revealing pivoting and its communication avoiding version

Author: Demmel James W.
Grigori Laura
Gu Ming
Khabou Amal
Publication venue
Publication date: 01/01/2012
Field of study

We present the LU decomposition with panel rank revealing pivoting (LU_PRRP), an LU factorization algorithm based on strong rank revealing QR panel factorization. LU_PRRP is more stable than Gaussian elimination with partial pivoting (GEPP). Our extensive numerical experiments show that the new factorization scheme is as numerically stable as GEPP in practice, but it is more resistant to pathological cases and easily solves the Wilkinson matrix and the Foster matrix. We also present CALU_PRRP, a communication avoiding version of LU_PRRP that minimizes communication. CALU_PRRP is based on tournament pivoting, with the selection of the pivots at each step of the tournament being performed via strong rank revealing QR factorization. CALU_PRRP is more stable than CALU, the communication avoiding version of GEPP. CALU_PRRP is also more stable in practice and is resistant to pathological cases on which GEPP and CALU fail.Comment: No. RR-7867 (2012

arXiv.org e-Print Archive

HAL-CentraleSupelec

CiteSeerX

HAL - Lille 3

INRIA a CCSD electronic archive server

MIMS EPrints

Hal-Diderot

HAL-Rennes 1

Hybrid static/dynamic scheduling for already optimized dense matrix factorization

Author: Donfack Simplice
Grigori Laura
Gropp William D.
Kale Vivek
Publication venue
Publication date: 08/10/2011
Field of study

We present the use of a hybrid static/dynamic scheduling strategy of the task dependency graph for direct methods used in dense numerical linear algebra. This strategy provides a balance of data locality, load balance, and low dequeue overhead. We show that the usage of this scheduling in communication avoiding dense factorization leads to significant performance gains. On a 48 core AMD Opteron NUMA machine, our experiments show that we can achieve up to 64% improvement over a version of CALU that uses fully dynamic scheduling, and up to 30% improvement over the version of CALU that uses fully static scheduling. On a 16-core Intel Xeon machine, our hybrid static/dynamic scheduling approach is up to 8% faster than the version of CALU that uses a fully static scheduling or fully dynamic scheduling. Our algorithm leads to speedups over the corresponding routines for computing LU factorization in well known libraries. On the 48 core AMD NUMA machine, our best implementation is up to 110% faster than MKL, while on the 16 core Intel Xeon machine, it is up to 82% faster than MKL. Our approach also shows significant speedups compared with PLASMA on both of these systems

arXiv.org e-Print Archive

HAL-CentraleSupelec

Illinois Digital Environment for Access to Learning and Scholarship Repository

HAL-Rennes 1

Fast Parallel Randomized QR with Column Pivoting Algorithms for Reliable Low-rank Matrix Approximations

Author: Gu Ming
Langou Julien
Xiao Jianwei
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 13/04/2018
Field of study

Factorizing large matrices by QR with column pivoting (QRCP) is substantially more expensive than QR without pivoting, owing to communication costs required for pivoting decisions. In contrast, randomized QRCP (RQRCP) algorithms have proven themselves empirically to be highly competitive with high-performance implementations of QR in processing time, on uniprocessor and shared memory machines, and as reliable as QRCP in pivot quality. We show that RQRCP algorithms can be as reliable as QRCP with failure probabilities exponentially decaying in oversampling size. We also analyze efficiency differences among different RQRCP algorithms. More importantly, we develop distributed memory implementations of RQRCP that are significantly better than QRCP implementations in ScaLAPACK. As a further development, we introduce the concept of and develop algorithms for computing spectrum-revealing QR factorizations for low-rank matrix approximations, and demonstrate their effectiveness against leading low-rank approximation methods in both theoretical and numerical reliability and efficiency.Comment: 11 pages, 14 figures, accepted by 2017 IEEE 24th International Conference on High Performance Computing (HiPC), awarded the best paper priz

arXiv.org e-Print Archive

Crossref

Minimizing Communication for Eigenproblems and the Singular Value Decomposition

Author: Ballard Grey
Demmel James
Dumitriu Ioana
Publication venue
Publication date: 01/01/2010
Field of study

Algorithms have two costs: arithmetic and communication. The latter represents the cost of moving data, either between levels of a memory hierarchy, or between processors over a network. Communication often dominates arithmetic and represents a rapidly increasing proportion of the total cost, so we seek algorithms that minimize communication. In \cite{BDHS10} lower bounds were presented on the amount of communication required for essentially all

O(n^3)

-like algorithms for linear algebra, including eigenvalue problems and the SVD. Conventional algorithms, including those currently implemented in (Sca)LAPACK, perform asymptotically more communication than these lower bounds require. In this paper we present parallel and sequential eigenvalue algorithms (for pencils, nonsymmetric matrices, and symmetric matrices) and SVD algorithms that do attain these lower bounds, and analyze their convergence and communication costs.Comment: 43 pages, 11 figure

arXiv.org e-Print Archive

CiteSeerX

Reducing Communication in the Solution of Linear Systems

Author: Lindquist Neil S
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/08/2023
Field of study

There is a growing performance gap between computation and communication on modern computers, making it crucial to develop algorithms with lower latency and bandwidth requirements. Because systems of linear equations are important for numerous scientific and engineering applications, I have studied several approaches for reducing communication in those problems. First, I developed optimizations to dense LU with partial pivoting, which downstream applications can adopt with little to no effort. Second, I consider two techniques to completely replace pivoting in dense LU, which can provide significantly higher speedups, albeit without the same numerical guarantees as partial pivoting. One technique uses randomized preprocessing, while the other is a novel combination of block factorization and additive perturbation. Finally, I investigate using mixed precision in GMRES for solving sparse systems, which reduces the volume of data movement, and thus, the pressure on the memory bandwidth

University of Tennessee, Knoxville: Trace

Block Discrete Empirical Interpolation Methods

Author: Gidisu Perfect Y.
Hochstenbach Michiel E.
Publication venue
Publication date: 03/08/2022
Field of study

We present two block variants of the discrete empirical interpolation method (DEIM); as a particular application, we will consider a CUR factorization. The block DEIM algorithms are based on the rank-revealing QR factorization and the concept of the maximum volume of submatrices. We also present a version of the block DEIM procedures, which allows for adaptive choice of block size. Experiments demonstrate that the block DEIM algorithms may provide a better low-rank approximation, and may also be computationally more efficient than the standard DEIM procedure

arXiv.org e-Print Archive

Approximation de rang faible pour les matrices creuses

Author: Cayrols Sébastien
Demmel James W.
Grigori Laura
Publication venue: HAL CCSD
Publication date: 01/03/2016
Field of study

In this paper we present an algorithm for computing a low rank approximation of a sparse matrix based on a truncated LU factorization with column and row permutations. We present various approaches for determining the column and row permutations that show a trade-off between speed versus deterministic/probabilistic accuracy. We show that if the permutations are chosen by using tournament pivoting based on QR factorization, then the obtained truncated LU factorization with column/row tournament pivoting, LU\_CRTP, satisfies bounds on the singular values which have similarities with the ones obtained by a communication avoiding rank revealing QR factorization. Experiments on challenging matrices show that LU_CRTP provides a good low rank approximation of the input matrix and it is less expensive than the rank revealing QR factorization in terms of computational and memory usage costs, while also minimizing the communication cost. We also compare the computational complexity of our algorithm with randomizedalgorithms and show that for sparse matrices and high enough but still modest accuracies, our approach is faster.Ce papier introduit un algorithme pour calculer une approximation de rang faible d’une matrice creuse. Cet algorithme est basé sur une factorisation LU avec des permutations de lignes et de colonnes

INRIA a CCSD electronic archive server

Hal-Diderot

Recursion based parallelization of exact dense linear algebra routines for Gaussian elimination

Author: Dumas Jean-Guillaume
Gautier Thierry
Pernet Clément
Roch Jean-Louis
Sultan Ziad
Publication venue: 'Elsevier BV'
Publication date: 01/09/2016
Field of study

International audienceWe present block algorithms and their implementation for the parallelization of sub-cubic Gaussian elimination on shared memory architectures.Contrarily to the classical cubic algorithms in parallel numerical linear algebra, we focus here on recursive algorithms and coarse grain parallelization.Indeed, sub-cubic matrix arithmetic can only be achieved through recursive algorithms making coarse grain block algorithms perform more efficiently than fine grain ones. This work is motivated by the design and implementation of dense linear algebraover a finite field, where fast matrix multiplication is used extensively and where costly modular reductions also advocate for coarse grain block decomposition. We incrementally build efficient kernels, for matrix multiplication first, then triangular system solving, on top of which a recursive PLUQ decomposition algorithm is built. We study the parallelization of these kernels using several algorithmic variants: either iterative or recursive and using different splitting strategies. Experiments show that recursive adaptive methods for matrix multiplication, hybrid recursive-iterative methods for triangular system solve and tile recursive versions of the PLUQ decomposition, together with various data mapping policies, provide the best performance on a 32 cores NUMA architecture. Overall, we show that the overhead of modular reductions is more than compensated by the fast linear algebra algorithms and that exact dense linear algebra matches the performance of full rank reference numerical software even in the presence of rank deficiencies

HAL-ENS-LYON

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Hal-Diderot

Recommended from our members

Two Geometric Results regarding Hölder-Brascamp-Lieb Inequalities, and Two Novel Algorithms for Low-Rank Approximation

Author: Rusciano Alexander
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

Broadly speaking, this thesis investigates mathematical questions motivated by computer science. The involved topics include communication avoiding algorithms, classical analysis, convex geometry, and low-rank matrix approximation. In total, the thesis consists of four self-contained sections, each adapted from papers the author has been a part of.The first two sections are both motivated by the Brascamp-Lieb inequalities, which are also often referred to as Hölder-Brascamp-Lieb inequalities. These inequalities have featured prominently in recent theoretical computer science work, due to connections to geometric complexity theory, harmonic analysis, communication-avoidance, and many other areas. Moreover, work generalizing the inequalities in various ways, such as to nonlinear versions, has been impactful to the study of differential equations.Section 1 studies the application of Hölder-Brascamp-Lieb (HBL) inequalities to the design of communication optimal algorithms. In particular, it describes optimal tiling (blocking) strategies for nested loops that lack data dependencies and exhibit affine memory access patterns. The problem roughly amounts to maximizing the volume of an object provided some of its linear images have bounded volume. The methods used are algorithmic.Another reason for the interest in these inequalities is because they are an interesting test case for non-convex optimization techniques. The optimal constant for a particular instance of the inequality is given by solving a non-convex optimization problem that is still highly structured. Of particular relevance to this thesis is that it can be formulated as a geodesically-convex problem, considered in the context of the manifold of positive definite matrices of determinant

1

. Even using the methods of Section 1, the procedure is not necessarily polynomial time, and this motivates further study of geodesic convexity.This lead to the work of Section 2, which discusses a notion of halfspace for Hadamard manifolds that is natural in the context of convex optimization. For this notion of halfspace, we generalize a classic result of Grunbaum, which itself is a corollary of Helly's theorem. Namely, given a probability distribution on the manifold, there is a point for which all halfspaces based at this point have at least 1/(n+1) of the mass, n being the dimension of the manifold. As an application, the gradient oracle complexity of geodesic convex optimization is polynomial in the parameters defining the problem. In particular it is polynomial in -log(epsilon), where epsilon is the desired error. This is a step toward the open question of whether such an algorithm exists.The remaining two sections of the paper present a different research direction, randomized numerical linear algebra. Numerical linear algebra has long been an important part of scientific computing. Due to the current trend of increasing matrix sizes and growing importance of fast, approximate solutions in industry, randomized methods are quickly increasing in popularity. Sections 3 and 4 in this thesis aim to show that randomized low-rank approximation algorithms satisfy many of the properties of classical rank-revealing factorizations.Section 3 introduces a Generalized Randomized QR-decomposition (RURV) that may be applied to arbitrary products of matrices and their inverses, without needing to explicitly compute the products or inverses. This factorization is a critical part of a communication-optimal spectral divide-and-conquer algorithm for the nonsymmetric eigenvalue problem. In this paper, we establish that this randomized QR-factorization satisfies the strong rank-revealing properties. We also formally prove its stability, making it suitable in applications. Finally, we present numerical experiments which demonstrate that our theoretical bounds capture the empirical behavior of the factorization.Section 4 concerns a Generalized LU-Factorization (GLU) for low-rank matrix approximation. We relate this to past approaches and extensively analyze its approximation properties. The established deterministic guarantees are combined with sketching ensembles satisfying Johnson-Lindenstrauss properties to present complete bounds. Particularly good performance is shown for the sub-sampled randomized Hadamard transform (SRHT) ensemble. Moreover, the factorization is shown to unify and generalize many past algorithms. It also helps to explain the effect of sketching on the growth factor during Gaussian Elimination

eScholarship - University of California

Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes

Author: Bosilca George
Faverge Mathieu
Lacoste Xavier
Ramet Pierre
Thibault Samuel
Publication venue
Publication date: 06/01/2014
Field of study

The ongoing hardware evolution exhibits an escalation in the number, as well as in the heterogeneity, of computing resources. The pressure to maintain reasonable levels of performance and portability forces application developers to leave the traditional programming paradigms and explore alternative solutions. PaStiX is a parallel sparse direct solver, based on a dynamic scheduler for modern hierarchical manycore architectures. In this paper, we study the benefits and limits of replacing the highly specialized internal scheduler of the PaStiX solver with two generic runtime systems: PaRSEC and StarPU. The tasks graph of the factorization step is made available to the two runtimes, providing them the opportunity to process and optimize its traversal in order to maximize the algorithm efficiency for the targeted hardware platform. A comparative study of the performance of the PaStiX solver on top of its native internal scheduler, PaRSEC, and StarPU frameworks, on different execution environments, is performed. The analysis highlights that these generic task-based runtimes achieve comparable results to the application-optimized embedded scheduler on homogeneous platforms. Furthermore, they are able to significantly speed up the solver on heterogeneous environments by taking advantage of the accelerators while hiding the complexity of their efficient manipulation from the programmer.Comment: Heterogeneity in Computing Workshop (2014

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server