Search CORE

968 research outputs found

The Problem with the Linpack Benchmark Matrix Generator

Author: Jack J. Dongarra
Jack J. Dongarra
Julien Langou
Julien Langou
Publication venue
Publication date: 01/01/2008
Field of study

We characterize the matrix sizes for which the Linpack Benchmark matrix generator constructs a matrix with identical columns

arXiv.org e-Print Archive

CiteSeerX

MIMS EPrints

Computing the Rank Profile Matrix

Author: Bourbaki N.
Dongarra J. J.
Grigor'ev D. Y.
Malaschonok G. I.
Storjohann A.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 06/07/2015
Field of study

The row (resp. column) rank profile of a matrix describes the staircase shape of its row (resp. column) echelon form. In an ISSAC'13 paper, we proposed a recursive Gaussian elimination that can compute simultaneously the row and column rank profiles of a matrix as well as those of all of its leading sub-matrices, in the same time as state of the art Gaussian elimination algorithms. Here we first study the conditions making a Gaus-sian elimination algorithm reveal this information. Therefore, we propose the definition of a new matrix invariant, the rank profile matrix, summarizing all information on the row and column rank profiles of all the leading sub-matrices. We also explore the conditions for a Gaussian elimination algorithm to compute all or part of this invariant, through the corresponding PLUQ decomposition. As a consequence, we show that the classical iterative CUP decomposition algorithm can actually be adapted to compute the rank profile matrix. Used, in a Crout variant, as a base-case to our ISSAC'13 implementation, it delivers a significant improvement in efficiency. Second, the row (resp. column) echelon form of a matrix are usually computed via different dedicated triangular decompositions. We show here that, from some PLUQ decompositions, it is possible to recover the row and column echelon forms of a matrix and of any of its leading sub-matrices thanks to an elementary post-processing algorithm

arXiv.org e-Print Archive

HAL-ENS-LYON

Crossref

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Hal-Diderot

On the Performance Prediction of BLAS-based Tensor Contractions

Author: CL Lawson
E Napoli Di
G Baumgartner
J C̆íz̆ek
JJ Dongarra
JJ Dongarra
L Lehner
LE Kidder
Q Lu
R Iakymchuk
RJ Bartlett
T Helgaker
Publication venue
Publication date: 30/09/2014
Field of study

Tensor operations are surging as the computational building blocks for a variety of scientific simulations and the development of high-performance kernels for such operations is known to be a challenging task. While for operations on one- and two-dimensional tensors there exist standardized interfaces and highly-optimized libraries (BLAS), for higher dimensional tensors neither standards nor highly-tuned implementations exist yet. In this paper, we consider contractions between two tensors of arbitrary dimensionality and take on the challenge of generating high-performance implementations by resorting to sequences of BLAS kernels. The approach consists in breaking the contraction down into operations that only involve matrices or vectors. Since in general there are many alternative ways of decomposing a contraction, we are able to methodically derive a large family of algorithms. The main contribution of this paper is a systematic methodology to accurately identify the fastest algorithms in the bunch, without executing them. The goal is instead accomplished with the help of a set of cache-aware micro-benchmarks for the underlying BLAS kernels. The predictions we construct from such benchmarks allow us to reliably single out the best-performing algorithms in a tiny fraction of the time taken by the direct execution of the algorithms.Comment: Submitted to PMBS1

arXiv.org e-Print Archive

Crossref

Publikationsserver der RWTH Aachen University

Message‐passing performance of various computers

Author: Jack J. Dongarra
Tom Dunigan
Publication venue: 'Wiley'
Publication date: 01/01/2002
Field of study

Crossref

Message-passing performance of various computers

Author: Jack J. Dongarra
Tom Dunigan
Publication venue: 'Wiley'
Publication date: 01/01/2005
Field of study

Crossref

Implementing Dense Linear Algebra Algorithms Using Multitasking on the CRAY X-MP-4 (or Approaching the Gigaflop)

Author: Jack J. Dongarra
Tom Hewitt
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date
Field of study

Crossref

Rectangular Full Packed Format for Cholesky's Algorithm:Factorization, Solution and Inversion

Author: Dongarra Jack J.
Gustavson Fred G.
Wasniewski Jerzy
Publication venue: Technical University of Denmark, DTU Informatics, Building 321
Publication date: 01/01/2008
Field of study

Online Research Database In Technology

Rectangular Full Packed Format for Cholesky's Algorithm: Factorization, Solution and Inversion

Author: Andersen B. S.
Fred G. Gustavson
Gustavson F. G.
Gustavson F. G.
Jack J. Dongarra
Jerzy Waśniewski
Julien Langou
Publication venue
Publication date: 01/01/2009
Field of study

We describe a new data format for storing triangular, symmetric, and Hermitian matrices called RFPF (Rectangular Full Packed Format). The standard two dimensional arrays of Fortran and C (also known as full format) that are used to represent triangular and symmetric matrices waste nearly half of the storage space but provide high performance via the use of Level 3 BLAS. Standard packed format arrays fully utilize storage (array space) but provide low performance as there is no Level 3 packed BLAS. We combine the good features of packed and full storage using RFPF to obtain high performance via using Level 3 BLAS as RFPF is a standard full format representation. Also, RFPF requires exactly the same minimal storage as packed format. Each LAPACK full and/or packed triangular, symmetric, and Hermitian routine becomes a single new RFPF routine based on eight possible data layouts of RFPF. This new RFPF routine usually consists of two calls to the corresponding LAPACK full format routine and two calls to Level 3 BLAS routines. This means {\it no} new software is required. As examples, we present LAPACK routines for Cholesky factorization, Cholesky solution and Cholesky inverse computation in RFPF to illustrate this new work and to describe its performance on several commonly used computer platforms. Performance of LAPACK full routines using RFPF versus LAPACK full routines using standard format for both serial and SMP parallel processing is about the same while using half the storage. Performance gains are roughly one to a factor of 43 for serial and one to a factor of 97 for SMP parallel times faster using vendor LAPACK full routines with RFPF than with using vendor and/or reference packed routines

arXiv.org e-Print Archive

CiteSeerX

Crossref

The University of Manchester - Institutional Repository

Online Research Database In Technology

GPU-Accelerated Asynchronous Error Correction for Mixed Precision Iterative Refinement

Author: Anzt H.
Dongarra J.
Heuveline Vincent
Luszczek P.
Publication venue: Karlsruher Institut für Technologie
Publication date: 01/01/2011
Field of study

In hardware-aware high performance computing, block-asynchronous iteration and mixed precision iterative refinement are two techniques that may be used to leverage the computing power of SIMD accelerators like GPUs in the iterative solution of linear equation systems. although they use a very different approach for this purpose, they share the basic idea of compensating the convergence properties of an inferior numerical algorithm by a more efficient usage of the provided computing power. In this paper, we analyze the potential of combining both techniques. Therefore, we derive a mixed precision iterative refinement algorithm using a block-asynchronous iteration as an error correction solver, and compare its performance with a pure implementation of a block-asynchronous iteration and an iterative refinement method using double precision for the error correction solver. For matrices from the University of Florida Matrix collection, we report the convergence behaviour and provide the total solver runtime using different GPU architectures

KITopen