Search CORE

5,627 research outputs found

LU factorization with panel rank revealing pivoting and its communication avoiding version

Author: Demmel James W.
Grigori Laura
Gu Ming
Khabou Amal
Publication venue
Publication date: 01/01/2012
Field of study

We present the LU decomposition with panel rank revealing pivoting (LU_PRRP), an LU factorization algorithm based on strong rank revealing QR panel factorization. LU_PRRP is more stable than Gaussian elimination with partial pivoting (GEPP). Our extensive numerical experiments show that the new factorization scheme is as numerically stable as GEPP in practice, but it is more resistant to pathological cases and easily solves the Wilkinson matrix and the Foster matrix. We also present CALU_PRRP, a communication avoiding version of LU_PRRP that minimizes communication. CALU_PRRP is based on tournament pivoting, with the selection of the pivots at each step of the tournament being performed via strong rank revealing QR factorization. CALU_PRRP is more stable than CALU, the communication avoiding version of GEPP. CALU_PRRP is also more stable in practice and is resistant to pathological cases on which GEPP and CALU fail.Comment: No. RR-7867 (2012

arXiv.org e-Print Archive

HAL-CentraleSupelec

CiteSeerX

HAL - Lille 3

INRIA a CCSD electronic archive server

Hal-Diderot

HAL-Rennes 1

A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures

Author: Buttari Alfredo
Dongarra Jack
Kurzak Jakub
Langou Julien
Publication venue
Publication date: 01/01/2007
Field of study

As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the Cholesky, LU and QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out of order execution of the tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithms where parallelism can only be exploited at the level of the BLAS operations and vendor implementations

arXiv.org e-Print Archive

CiteSeerX

A new approximate matrix factorization for implicit time integration in air pollution modeling

Author: Botchev M.A.
Verwer J.G.
Publication venue: Elsevier
Publication date: 01/01/2003
Field of study

Implicit time stepping typically requires solution of one or several linear systems with a matrix I−τJ per time step where J is the Jacobian matrix. If solution of these systems is expensive, replacing I−τJ with its approximate matrix factorization (AMF) (I−τR)(I−τV), R+V=J, often leads to a good compromise between stability and accuracy of the time integration on the one hand and its efficiency on the other hand. For example, in air pollution modeling, AMF has been successfully used in the framework of Rosenbrock schemes. The standard AMF gives an approximation to I−τJ with the error τ2RV, which can be significant in norm. In this paper we propose a new AMF. In assumption that −V is an M-matrix, the error of the new AMF can be shown to have an upper bound τ||R||, while still being asymptotically

O(\tau^2)

. This new AMF, called AMF+, is equal in costs to standard AMF and, as both analysis and numerical experiments reveal, provides a better accuracy. We also report on our experience with another, cheaper AMF and with AMF-preconditioned GMRES

CiteSeerX

University of Twente Research Information

Hybrid static/dynamic scheduling for already optimized dense matrix factorization

Author: Donfack Simplice
Grigori Laura
Gropp William D.
Kale Vivek
Publication venue
Publication date: 08/10/2011
Field of study

We present the use of a hybrid static/dynamic scheduling strategy of the task dependency graph for direct methods used in dense numerical linear algebra. This strategy provides a balance of data locality, load balance, and low dequeue overhead. We show that the usage of this scheduling in communication avoiding dense factorization leads to significant performance gains. On a 48 core AMD Opteron NUMA machine, our experiments show that we can achieve up to 64% improvement over a version of CALU that uses fully dynamic scheduling, and up to 30% improvement over the version of CALU that uses fully static scheduling. On a 16-core Intel Xeon machine, our hybrid static/dynamic scheduling approach is up to 8% faster than the version of CALU that uses a fully static scheduling or fully dynamic scheduling. Our algorithm leads to speedups over the corresponding routines for computing LU factorization in well known libraries. On the 48 core AMD NUMA machine, our best implementation is up to 110% faster than MKL, while on the 16 core Intel Xeon machine, it is up to 82% faster than MKL. Our approach also shows significant speedups compared with PLASMA on both of these systems

arXiv.org e-Print Archive

HAL-CentraleSupelec

HAL-Rennes 1

An efficient multi-core implementation of a novel HSS-structured multifrontal solver using randomized sampling

Author: Ghysels Pieter
Li Xiaoye S.
Napov Artem
Rouet Francois-Henry
Williams Samuel
Publication venue
Publication date: 25/02/2015
Field of study

We present a sparse linear system solver that is based on a multifrontal variant of Gaussian elimination, and exploits low-rank approximation of the resulting dense frontal matrices. We use hierarchically semiseparable (HSS) matrices, which have low-rank off-diagonal blocks, to approximate the frontal matrices. For HSS matrix construction, a randomized sampling algorithm is used together with interpolative decompositions. The combination of the randomized compression with a fast ULV HSS factorization leads to a solver with lower computational complexity than the standard multifrontal method for many applications, resulting in speedups up to 7 fold for problems in our test suite. The implementation targets many-core systems by using task parallelism with dynamic runtime scheduling. Numerical experiments show performance improvements over state-of-the-art sparse direct solvers. The implementation achieves high performance and good scalability on a range of modern shared memory parallel systems, including the Intel Xeon Phi (MIC). The code is part of a software package called STRUMPACK -- STRUctured Matrices PACKage, which also has a distributed memory component for dense rank-structured matrices

arXiv.org e-Print Archive

eScholarship - University of California

DI-fusion

High performance interior point methods for three-dimensional finite element limit analysis

Author: Lyamin Andrei V.
Podlich Nathan
Sloan Scott W.
Publication venue: CIMNE
Publication date: 01/01/2019
Field of study

The ability to obtain rigorous upper and lower bounds on collapse loads of various structures makes ﬁnite element limit analysis an attractive design tool. The increasingly high cost of computing those bounds, however, has limited its application on problems in three dimensions. This work reports on a high-performance homogeneous self-dual primal-dual interior point method developed for three-dimensional ﬁnite element limit analysis. This implementation achieves convergence times over 4.5× faster than the leading commercial solver across a set of three-dimensional ﬁnite element limit analysis test problems, making investigation of three dimensional limit loads viable. A comparison between a range of iterative linear solvers and direct methods used to determine the search direction is also provided, demonstrating the superiority of direct methods for this application. The components of the interior point solver considered include the elimination of and options for handling remaining free variables, multifrontal and supernodal Cholesky comparison for computing the search direction, diﬀerences between approximate minimum degree [1] and nested dissection [13] orderings, dealing with dense columns and ﬁxed variables, and accelerating the linear system solver through parallelization. Each of these areas resulted in an improvement on at least one of the problems in the test set, with many achieving gains across the whole set. The serial implementation achieved runtime performance 1.7× faster than the commercial solver Mosek [5]. Compared with the parallel version of Mosek, the use of parallel BLAS routines in the supernodal solver saw a 1.9× speedup, and with a modiﬁed version of the GPU-enabled CHOLMOD [11] and a single NVIDIA Tesla K20c this speedup increased to 4.65×