2,920 research outputs found
The Problem with the Linpack Benchmark Matrix Generator
We characterize the matrix sizes for which the Linpack Benchmark matrix
generator constructs a matrix with identical columns
Efficient computation of Hamiltonian matrix elements between non-orthogonal Slater determinants
We present an efficient numerical method for computing Hamiltonian matrix
elements between non-orthogonal Slater determinants, focusing on the most
time-consuming component of the calculation that involves a sparse array. In
the usual case where many matrix elements should be calculated, this
computation can be transformed into a multiplication of dense matrices. It is
demonstrated that the present method based on the matrix-matrix multiplication
attains 80% of the theoretical peak performance measured on systems
equipped with modern microprocessors, a factor of 5-10 better than the normal
method using indirectly indexed arrays to treat a sparse array. The reason for
such different performances is discussed from the viewpoint of memory access.Comment: 8 pages, 3 figure
QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
Previous studies have reported that common dense linear algebra operations do
not achieve speed up by using multiple geographical sites of a computational
grid. Because such operations are the building blocks of most scientific
applications, conventional supercomputers are still strongly predominant in
high-performance computing and the use of grids for speeding up large-scale
scientific problems is limited to applications exhibiting parallelism at a
higher level. We have identified two performance bottlenecks in the distributed
memory algorithms implemented in ScaLAPACK, a state-of-the-art dense linear
algebra library. First, because ScaLAPACK assumes a homogeneous communication
network, the implementations of ScaLAPACK algorithms lack locality in their
communication pattern. Second, the number of messages sent in the ScaLAPACK
algorithms is significantly greater than other algorithms that trade flops for
communication. In this paper, we present a new approach for computing a QR
factorization -- one of the main dense linear algebra kernels -- of tall and
skinny matrices in a grid computing environment that overcomes these two
bottlenecks. Our contribution is to articulate a recently proposed algorithm
(Communication Avoiding QR) with a topology-aware middleware (QCG-OMPI) in
order to confine intensive communications (ScaLAPACK calls) within the
different geographical sites. An experimental study conducted on the Grid'5000
platform shows that the resulting performance increases linearly with the
number of geographical sites on large-scale problems (and is in particular
consistently higher than ScaLAPACK's).Comment: Accepted at IPDPS10. (IEEE International Parallel & Distributed
Processing Symposium 2010 in Atlanta, GA, USA.
Algorithmic Based Fault Tolerance Applied to High Performance Computing
We present a new approach to fault tolerance for High Performance Computing
system. Our approach is based on a careful adaptation of the Algorithmic Based
Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel
distributed computation. We obtain a strongly scalable mechanism for fault
tolerance. We can also detect and correct errors (bit-flip) on the fly of a
computation. To assess the viability of our approach, we have developed a fault
tolerant matrix-matrix multiplication subroutine and we propose some models to
predict its running time. Our parallel fault-tolerant matrix-matrix
multiplication scores 1.4 TFLOPS on 484 processors (cluster jacquard.nersc.gov)
and returns a correct result while one process failure has happened. This
represents 65% of the machine peak efficiency and less than 12% overhead with
respect to the fastest failure-free implementation. We predict (and have
observed) that, as we increase the processor count, the overhead of the fault
tolerance drops significantly
Computing the Conditioning of the Components of a Linear Least Squares Solution
In this paper, we address the accuracy of the results for the overdetermined
full rank linear least squares problem. We recall theoretical results obtained
in Arioli, Baboulin and Gratton, SIMAX 29(2):413--433, 2007, on conditioning of
the least squares solution and the components of the solution when the matrix
perturbations are measured in Frobenius or spectral norms. Then we define
computable estimates for these condition numbers and we interpret them in terms
of statistical quantities. In particular, we show that, in the classical linear
statistical model, the ratio of the variance of one component of the solution
by the variance of the right-hand side is exactly the condition number of this
solution component when perturbations on the right-hand side are considered. We
also provide fragment codes using LAPACK routines to compute the
variance-covariance matrix and the least squares conditioning and we give the
corresponding computational cost. Finally we present a small historical
numerical example that was used by Laplace in Theorie Analytique des
Probabilites, 1820, for computing the mass of Jupiter and experiments from the
space industry with real physical data
A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures
As multicore systems continue to gain ground in the High Performance
Computing world, linear algebra algorithms have to be reformulated or new
algorithms have to be developed in order to take advantage of the architectural
features on these new processors. Fine grain parallelism becomes a major
requirement and introduces the necessity of loose synchronization in the
parallel execution of an operation. This paper presents an algorithm for the
Cholesky, LU and QR factorization where the operations can be represented as a
sequence of small tasks that operate on square blocks of data. These tasks can
be dynamically scheduled for execution based on the dependencies among them and
on the availability of computational resources. This may result in an out of
order execution of the tasks which will completely hide the presence of
intrinsically sequential tasks in the factorization. Performance comparisons
are presented with the LAPACK algorithms where parallelism can only be
exploited at the level of the BLAS operations and vendor implementations
Lecture 11: The Road to Exascale and Legacy Software for Dense Linear Algebra
In this talk, we will look at the current state of high performance computing and look at the next stage of extreme computing. With extreme computing, there will be fundamental changes in the character of floating point arithmetic and data movement. In this talk, we will look at how extreme-scale computing has caused algorithm and software developers to change their way of thinking on implementing and program-specific applications
On the Performance Prediction of BLAS-based Tensor Contractions
Tensor operations are surging as the computational building blocks for a
variety of scientific simulations and the development of high-performance
kernels for such operations is known to be a challenging task. While for
operations on one- and two-dimensional tensors there exist standardized
interfaces and highly-optimized libraries (BLAS), for higher dimensional
tensors neither standards nor highly-tuned implementations exist yet. In this
paper, we consider contractions between two tensors of arbitrary dimensionality
and take on the challenge of generating high-performance implementations by
resorting to sequences of BLAS kernels. The approach consists in breaking the
contraction down into operations that only involve matrices or vectors. Since
in general there are many alternative ways of decomposing a contraction, we are
able to methodically derive a large family of algorithms. The main contribution
of this paper is a systematic methodology to accurately identify the fastest
algorithms in the bunch, without executing them. The goal is instead
accomplished with the help of a set of cache-aware micro-benchmarks for the
underlying BLAS kernels. The predictions we construct from such benchmarks
allow us to reliably single out the best-performing algorithms in a tiny
fraction of the time taken by the direct execution of the algorithms.Comment: Submitted to PMBS1
Characterization of the errors of the FMM in particle simulations
The Fast Multipole Method (FMM) offers an acceleration for pairwise
interaction calculation, known as -body problems, from to
with particles. This has brought dramatic increase in the
capability of particle simulations in many application areas, such as
electrostatics, particle formulations of fluid mechanics, and others. Although
the literature on the subject provides theoretical error bounds for the FMM
approximation, there are not many reports of the measured errors in a suite of
computational experiments. We have performed such an experimental
investigation, and summarized the results of about 1000 calculations using the
FMM algorithm, to characterize the accuracy of the method in relation with the
different parameters available to the user. In addition to the more standard
diagnostic of the maximum error, we supply illustrations of the spatial
distribution of the errors, which offers visual evidence of all the
contributing factors to the overall approximation accuracy: multipole
expansion, local expansion, hierarchical spatial decomposition (interaction
lists, local domain, far domain). This presentation is a contribution to any
researcher wishing to incorporate the FMM acceleration to their application
code, as it aids in understanding where accuracy is gained or compromised.Comment: 34 pages, 38 image
- âŠ