Search CORE

415 research outputs found

On the Performance Prediction of BLAS-based Tensor Contractions

Author: CL Lawson
E Napoli Di
G Baumgartner
J C̆íz̆ek
JJ Dongarra
JJ Dongarra
L Lehner
LE Kidder
Q Lu
R Iakymchuk
RJ Bartlett
T Helgaker
Publication venue
Publication date: 30/09/2014
Field of study

Tensor operations are surging as the computational building blocks for a variety of scientific simulations and the development of high-performance kernels for such operations is known to be a challenging task. While for operations on one- and two-dimensional tensors there exist standardized interfaces and highly-optimized libraries (BLAS), for higher dimensional tensors neither standards nor highly-tuned implementations exist yet. In this paper, we consider contractions between two tensors of arbitrary dimensionality and take on the challenge of generating high-performance implementations by resorting to sequences of BLAS kernels. The approach consists in breaking the contraction down into operations that only involve matrices or vectors. Since in general there are many alternative ways of decomposing a contraction, we are able to methodically derive a large family of algorithms. The main contribution of this paper is a systematic methodology to accurately identify the fastest algorithms in the bunch, without executing them. The goal is instead accomplished with the help of a set of cache-aware micro-benchmarks for the underlying BLAS kernels. The predictions we construct from such benchmarks allow us to reliably single out the best-performing algorithms in a tiny fraction of the time taken by the direct execution of the algorithms.Comment: Submitted to PMBS1

arXiv.org e-Print Archive

Crossref

Publikationsserver der RWTH Aachen University

Developing numerical libraries in Java

Author: Boisvert Ronald F.
Dongarra Jack J.
Pozo Roldan
Remington Karin
Stewart G. W.
Publication venue
Publication date: 01/01/1998
Field of study

The rapid and widespread adoption of Java has created a demand for reliable and reusable mathematical software components to support the growing number of compute-intensive applications now under development, particularly in science and engineering. In this paper we address practical issues of the Java language and environment which have an effect on numerical library design and development. Benchmarks which illustrate the current levels of performance of key numerical kernels on a variety of Java platforms are presented. Finally, a strategy for the development of a fundamental numerical toolkit for Java is proposed and its current status is described.Comment: 11 pages. Revised version of paper presented to the 1998 ACM Conference on Java for High Performance Network Computing. To appear in Concurrency: Practice and Experienc

arXiv.org e-Print Archive

CiteSeerX

The University of Manchester - Institutional Repository

Rectangular Full Packed Format for Cholesky's Algorithm: Factorization, Solution and Inversion

Author: Andersen B. S.
Fred G. Gustavson
Gustavson F. G.
Gustavson F. G.
Jack J. Dongarra
Jerzy Waśniewski
Julien Langou
Publication venue
Publication date: 01/01/2009
Field of study

We describe a new data format for storing triangular, symmetric, and Hermitian matrices called RFPF (Rectangular Full Packed Format). The standard two dimensional arrays of Fortran and C (also known as full format) that are used to represent triangular and symmetric matrices waste nearly half of the storage space but provide high performance via the use of Level 3 BLAS. Standard packed format arrays fully utilize storage (array space) but provide low performance as there is no Level 3 packed BLAS. We combine the good features of packed and full storage using RFPF to obtain high performance via using Level 3 BLAS as RFPF is a standard full format representation. Also, RFPF requires exactly the same minimal storage as packed format. Each LAPACK full and/or packed triangular, symmetric, and Hermitian routine becomes a single new RFPF routine based on eight possible data layouts of RFPF. This new RFPF routine usually consists of two calls to the corresponding LAPACK full format routine and two calls to Level 3 BLAS routines. This means {\it no} new software is required. As examples, we present LAPACK routines for Cholesky factorization, Cholesky solution and Cholesky inverse computation in RFPF to illustrate this new work and to describe its performance on several commonly used computer platforms. Performance of LAPACK full routines using RFPF versus LAPACK full routines using standard format for both serial and SMP parallel processing is about the same while using half the storage. Performance gains are roughly one to a factor of 43 for serial and one to a factor of 97 for SMP parallel times faster using vendor LAPACK full routines with RFPF than with using vendor and/or reference packed routines

arXiv.org e-Print Archive

CiteSeerX

Crossref

The University of Manchester - Institutional Repository

Online Research Database In Technology

Computing the Rank Profile Matrix

Author: Bourbaki N.
Dongarra J. J.
Grigor'ev D. Y.
Malaschonok G. I.
Storjohann A.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 06/07/2015
Field of study

The row (resp. column) rank profile of a matrix describes the staircase shape of its row (resp. column) echelon form. In an ISSAC'13 paper, we proposed a recursive Gaussian elimination that can compute simultaneously the row and column rank profiles of a matrix as well as those of all of its leading sub-matrices, in the same time as state of the art Gaussian elimination algorithms. Here we first study the conditions making a Gaus-sian elimination algorithm reveal this information. Therefore, we propose the definition of a new matrix invariant, the rank profile matrix, summarizing all information on the row and column rank profiles of all the leading sub-matrices. We also explore the conditions for a Gaussian elimination algorithm to compute all or part of this invariant, through the corresponding PLUQ decomposition. As a consequence, we show that the classical iterative CUP decomposition algorithm can actually be adapted to compute the rank profile matrix. Used, in a Crout variant, as a base-case to our ISSAC'13 implementation, it delivers a significant improvement in efficiency. Second, the row (resp. column) echelon form of a matrix are usually computed via different dedicated triangular decompositions. We show here that, from some PLUQ decompositions, it is possible to recover the row and column echelon forms of a matrix and of any of its leading sub-matrices thanks to an elementary post-processing algorithm

arXiv.org e-Print Archive

HAL-ENS-LYON

Crossref

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Hal-Diderot

A survey of parallel algorithms for fractal image compression

Author: Beaumont J. M.
Bodo Z. P.
Chang H. T.
Davis G. M.
Dongarra J.
Huffman D. A.
Mallat S. G.
Publication venue: 'Multi-Science Publishing Co. Ltd.'
Publication date: 01/06/2007
Field of study

This paper presents a short survey of the key research work that has been undertaken in the application of parallel algorithms for Fractal image compression. The interest in fractal image compression techniques stems from their ability to achieve high compression ratios whilst maintaining a very high quality in the reconstructed image. The main drawback of this compression method is the very high computational cost that is associated with the encoding phase. Consequently, there has been significant interest in exploiting parallel computing architectures in order to speed up this phase, whilst still maintaining the advantageous features of the approach. This paper presents a brief introduction to fractal image compression, including the iterated function system theory upon which it is based, and then reviews the different techniques that have been, and can be, applied in order to parallelize the compression algorithm

CiteSeerX

Crossref

White Rose Research Online

Rectangular Full Packed Format for Cholesky's Algorithm:Factorization, Solution and Inversion

Author: Dongarra Jack J.
Gustavson Fred G.
Wasniewski Jerzy
Publication venue: Technical University of Denmark, DTU Informatics, Building 321
Publication date: 01/01/2008
Field of study

Online Research Database In Technology

Level-3 Cholesky Kernel Subroutine of a Fully Portable High Performance Minimal Storage Hybrid Format Cholesky Algorithm

Author: Dongarra Jack J.
Gustavson Fred G.
Wasniewski Jerzy
Publication venue: Technical University of Denmark, DTU Informatics, Building 321
Publication date: 01/01/2008
Field of study

CiteSeerX

Online Research Database In Technology

Variable-Size Batched Condition Number Calculation on GPUs

Author: Anzt Hartwig
Dongarra J.
Flegar G.
Grützmacher Thomas
Publication venue: Institute of Electrical and Electronics Engineers
Publication date: 01/01/2019
Field of study

Crossref

KITopen

A review of High Performance Computing foundations for scientists

Author: Cramer C. J.
Dongarra J.
Dror R. O.
Goldberg D.
Hager G.
Haoqiang J.
Hennessy J. L.
Marx D.
Moore G.
PABLO E. IBÁÑEZ
PABLO GARCÍA-RISUEÑO
Wilkinson B.
Woo D. H.
Publication venue: 'World Scientific Pub Co Pte Lt'
Publication date: 23/05/2012
Field of study

The increase of existing computational capabilities has made simulation emerge as a third discipline of Science, lying midway between experimental and purely theoretical branches [1, 2]. Simulation enables the evaluation of quantities which otherwise would not be accessible, helps to improve experiments and provides new insights on systems which are analysed [3-6]. Knowing the fundamentals of computation can be very useful for scientists, for it can help them to improve the performance of their theoretical models and simulations. This review includes some technical essentials that can be useful to this end, and it is devised as a complement for researchers whose education is focused on scientific issues and not on technological respects. In this document we attempt to discuss the fundamentals of High Performance Computing (HPC) [7] in a way which is easy to understand without much previous background. We sketch the way standard computers and supercomputers work, as well as discuss distributed computing and discuss essential aspects to take into account when running scientific calculations in computers.Comment: 33 page

arXiv.org e-Print Archive

Crossref

Empowering Science through Computing, Preface for ICCS 2012

Author: Ali Hesham
Shi Yong
Khazanchi Deepak
Lees Michael
Albada G. Dick van
Dongarra Jack
Sloot Peter M.A.
Publication venue: Published by Elsevier B.V.
Publication date: 14/05/1928
Field of study

Digitalitzat per Artypla

Elsevier - Publisher Connector

Crossref

Repositori Obert de Coneixement de l'Ajuntament de Barcelona

UvA-DARE