Search CORE

2,920 research outputs found

The Problem with the Linpack Benchmark Matrix Generator

Author: Jack J. Dongarra
Jack J. Dongarra
Julien Langou
Julien Langou
Publication venue
Publication date: 01/01/2008
Field of study

We characterize the matrix sizes for which the Linpack Benchmark matrix generator constructs a matrix with identical columns

arXiv.org e-Print Archive

CiteSeerX

MIMS EPrints

Efficient computation of Hamiltonian matrix elements between non-orthogonal Slater determinants

Author: Bender
Bender
Bohr
Dongarra
Dongarra
Dongarra
Dongarra
Griffin
Hill
Koch
Lawson
Morgon
Noritaka Shimizu
Otsuka
Puddu
Ring
Robledo
Sabbey
Sarkar
Schmid
Scuseria
Shimizu
Takaharu Otsuka
Takashi Abe
Thouless
Tomita
Wadleigh
Wick
Yutaka Utsuno
Publication venue: 'Elsevier BV'
Publication date: 18/09/2012
Field of study

We present an efficient numerical method for computing Hamiltonian matrix elements between non-orthogonal Slater determinants, focusing on the most time-consuming component of the calculation that involves a sparse array. In the usual case where many matrix elements should be calculated, this computation can be transformed into a multiplication of dense matrices. It is demonstrated that the present method based on the matrix-matrix multiplication attains

\sim

80% of the theoretical peak performance measured on systems equipped with modern microprocessors, a factor of 5-10 better than the normal method using indirectly indexed arrays to treat a sparse array. The reason for such different performances is discussed from the viewpoint of memory access.Comment: 8 pages, 3 figure

arXiv.org e-Print Archive

Lirias

Crossref

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

Author: Camille Coti
Camille Coti
Camille Coti
Emmanuel Agullo
Emmanuel Agullo
Emmanuel Agullo
Jack Dongarra
Jack Dongarra
Jack Dongarra
Julien Langou
Julien Langou
Qr Fac
Thomas Herault
Thomas Herault
Thomas Herault
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 13/12/2009
Field of study

Previous studies have reported that common dense linear algebra operations do not achieve speed up by using multiple geographical sites of a computational grid. Because such operations are the building blocks of most scientific applications, conventional supercomputers are still strongly predominant in high-performance computing and the use of grids for speeding up large-scale scientific problems is limited to applications exhibiting parallelism at a higher level. We have identified two performance bottlenecks in the distributed memory algorithms implemented in ScaLAPACK, a state-of-the-art dense linear algebra library. First, because ScaLAPACK assumes a homogeneous communication network, the implementations of ScaLAPACK algorithms lack locality in their communication pattern. Second, the number of messages sent in the ScaLAPACK algorithms is significantly greater than other algorithms that trade flops for communication. In this paper, we present a new approach for computing a QR factorization -- one of the main dense linear algebra kernels -- of tall and skinny matrices in a grid computing environment that overcomes these two bottlenecks. Our contribution is to articulate a recently proposed algorithm (Communication Avoiding QR) with a topology-aware middleware (QCG-OMPI) in order to confine intensive communications (ScaLAPACK calls) within the different geographical sites. An experimental study conducted on the Grid'5000 platform shows that the resulting performance increases linearly with the number of geographical sites on large-scale problems (and is in particular consistently higher than ScaLAPACK's).Comment: Accepted at IPDPS10. (IEEE International Parallel & Distributed Processing Symposium 2010 in Atlanta, GA, USA.

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL-Rennes 1

Algorithmic Based Fault Tolerance Applied to High Performance Computing

Author: Bosilca George
Delmas Remi
Dongarra Jack
Langou Julien
Publication venue
Publication date: 01/01/2008
Field of study

We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithmic Based Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel distributed computation. We obtain a strongly scalable mechanism for fault tolerance. We can also detect and correct errors (bit-flip) on the fly of a computation. To assess the viability of our approach, we have developed a fault tolerant matrix-matrix multiplication subroutine and we propose some models to predict its running time. Our parallel fault-tolerant matrix-matrix multiplication scores 1.4 TFLOPS on 484 processors (cluster jacquard.nersc.gov) and returns a correct result while one process failure has happened. This represents 65% of the machine peak efficiency and less than 12% overhead with respect to the fastest failure-free implementation. We predict (and have observed) that, as we increase the processor count, the overhead of the fault tolerance drops significantly

arXiv.org e-Print Archive

CiteSeerX

MIMS EPrints

The University of Manchester - Institutional Repository

Computing the Conditioning of the Components of a Linear Least Squares Solution

Author: Baboulin Marc
Dongarra Jack
Gratton Serge
Langou Julien
Publication venue
Publication date: 03/10/2007
Field of study

In this paper, we address the accuracy of the results for the overdetermined full rank linear least squares problem. We recall theoretical results obtained in Arioli, Baboulin and Gratton, SIMAX 29(2):413--433, 2007, on conditioning of the least squares solution and the components of the solution when the matrix perturbations are measured in Frobenius or spectral norms. Then we define computable estimates for these condition numbers and we interpret them in terms of statistical quantities. In particular, we show that, in the classical linear statistical model, the ratio of the variance of one component of the solution by the variance of the right-hand side is exactly the condition number of this solution component when perturbations on the right-hand side are considered. We also provide fragment codes using LAPACK routines to compute the variance-covariance matrix and the least squares conditioning and we give the corresponding computational cost. Finally we present a small historical numerical example that was used by Laplace in Theorie Analytique des Probabilites, 1820, for computing the mass of Jupiter and experiments from the space industry with real physical data

arXiv.org e-Print Archive

MIMS EPrints

A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures

Author: Buttari Alfredo
Dongarra Jack
Kurzak Jakub
Langou Julien
Publication venue
Publication date: 01/01/2007
Field of study

As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the Cholesky, LU and QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out of order execution of the tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithms where parallelism can only be exploited at the level of the BLAS operations and vendor implementations

arXiv.org e-Print Archive

CiteSeerX

Scientific Publications of the University of Toulouse II Le Mirail

HAL Descartes

MIMS EPrints

The University of Manchester - Institutional Repository

Lecture 11: The Road to Exascale and Legacy Software for Dense Linear Algebra

Author: Dongarra Jack
Publication venue: ScholarWorks@UARK
Publication date: 06/04/2021
Field of study

In this talk, we will look at the current state of high performance computing and look at the next stage of extreme computing. With extreme computing, there will be fundamental changes in the character of floating point arithmetic and data movement. In this talk, we will look at how extreme-scale computing has caused algorithm and software developers to change their way of thinking on implementing and program-specific applications

ScholarWorks@UARK

On the Performance Prediction of BLAS-based Tensor Contractions

Author: CL Lawson
E Napoli Di
G Baumgartner
J C̆íz̆ek
JJ Dongarra
JJ Dongarra
L Lehner
LE Kidder
Q Lu
R Iakymchuk
RJ Bartlett
T Helgaker
Publication venue
Publication date: 30/09/2014
Field of study

Tensor operations are surging as the computational building blocks for a variety of scientific simulations and the development of high-performance kernels for such operations is known to be a challenging task. While for operations on one- and two-dimensional tensors there exist standardized interfaces and highly-optimized libraries (BLAS), for higher dimensional tensors neither standards nor highly-tuned implementations exist yet. In this paper, we consider contractions between two tensors of arbitrary dimensionality and take on the challenge of generating high-performance implementations by resorting to sequences of BLAS kernels. The approach consists in breaking the contraction down into operations that only involve matrices or vectors. Since in general there are many alternative ways of decomposing a contraction, we are able to methodically derive a large family of algorithms. The main contribution of this paper is a systematic methodology to accurately identify the fastest algorithms in the bunch, without executing them. The goal is instead accomplished with the help of a set of cache-aware micro-benchmarks for the underlying BLAS kernels. The predictions we construct from such benchmarks allow us to reliably single out the best-performing algorithms in a tiny fraction of the time taken by the direct execution of the algorithms.Comment: Submitted to PMBS1

arXiv.org e-Print Archive

Crossref

Publikationsserver der RWTH Aachen University

Characterization of the errors of the FMM in particle simulations

Author: Appel Andrew
Barba
Barnes
Chatelain
Cottet
Dongarra
Fenley
Greengard
Gáspar
Hamilton
Publication venue: 'Wiley'
Publication date: 10/09/2008
Field of study

The Fast Multipole Method (FMM) offers an acceleration for pairwise interaction calculation, known as

N

-body problems, from

\mathcal{O}(N^2)

\mathcal{O}(N)

with

N

particles. This has brought dramatic increase in the capability of particle simulations in many application areas, such as electrostatics, particle formulations of fluid mechanics, and others. Although the literature on the subject provides theoretical error bounds for the FMM approximation, there are not many reports of the measured errors in a suite of computational experiments. We have performed such an experimental investigation, and summarized the results of about 1000 calculations using the FMM algorithm, to characterize the accuracy of the method in relation with the different parameters available to the user. In addition to the more standard diagnostic of the maximum error, we supply illustrations of the spatial distribution of the errors, which offers visual evidence of all the contributing factors to the overall approximation accuracy: multipole expansion, local expansion, hierarchical spatial decomposition (interaction lists, local domain, far domain). This presentation is a contribution to any researcher wishing to incorporate the FMM acceleration to their application code, as it aids in understanding where accuracy is gained or compromised.Comment: 34 pages, 38 image

arXiv.org e-Print Archive

Crossref

Explore Bristol Research