11,532 research outputs found
Cache-aware Performance Modeling and Prediction for Dense Linear Algebra
Countless applications cast their computational core in terms of dense linear
algebra operations. These operations can usually be implemented by combining
the routines offered by standard linear algebra libraries such as BLAS and
LAPACK, and typically each operation can be obtained in many alternative ways.
Interestingly, identifying the fastest implementation -- without executing it
-- is a challenging task even for experts. An equally challenging task is that
of tuning each routine to performance-optimal configurations. Indeed, the
problem is so difficult that even the default values provided by the libraries
are often considerably suboptimal; as a solution, normally one has to resort to
executing and timing the routines, driven by some form of parameter search. In
this paper, we discuss a methodology to solve both problems: identifying the
best performing algorithm within a family of alternatives, and tuning
algorithmic parameters for maximum performance; in both cases, we do not
execute the algorithms themselves. Instead, our methodology relies on timing
and modeling the computational kernels underlying the algorithms, and on a
technique for tracking the contents of the CPU cache. In general, our
performance predictions allow us to tune dense linear algebra algorithms within
few percents from the best attainable results, thus allowing computational
scientists and code developers alike to efficiently optimize their linear
algebra routines and codes.Comment: Submitted to PMBS1
Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes
The ongoing hardware evolution exhibits an escalation in the number, as well
as in the heterogeneity, of computing resources. The pressure to maintain
reasonable levels of performance and portability forces application developers
to leave the traditional programming paradigms and explore alternative
solutions. PaStiX is a parallel sparse direct solver, based on a dynamic
scheduler for modern hierarchical manycore architectures. In this paper, we
study the benefits and limits of replacing the highly specialized internal
scheduler of the PaStiX solver with two generic runtime systems: PaRSEC and
StarPU. The tasks graph of the factorization step is made available to the two
runtimes, providing them the opportunity to process and optimize its traversal
in order to maximize the algorithm efficiency for the targeted hardware
platform. A comparative study of the performance of the PaStiX solver on top of
its native internal scheduler, PaRSEC, and StarPU frameworks, on different
execution environments, is performed. The analysis highlights that these
generic task-based runtimes achieve comparable results to the
application-optimized embedded scheduler on homogeneous platforms. Furthermore,
they are able to significantly speed up the solver on heterogeneous
environments by taking advantage of the accelerators while hiding the
complexity of their efficient manipulation from the programmer.Comment: Heterogeneity in Computing Workshop (2014
A bibliography on parallel and vector numerical algorithms
This is a bibliography of numerical methods. It also includes a number of other references on machine architecture, programming language, and other topics of interest to scientific computing. Certain conference proceedings and anthologies which have been published in book form are listed also
Solving polynomial eigenvalue problems by means of the Ehrlich-Aberth method
Given the matrix polynomial , we
consider the associated polynomial eigenvalue problem. This problem, viewed in
terms of computing the roots of the scalar polynomial , is treated
in polynomial form rather than in matrix form by means of the Ehrlich-Aberth
iteration. The main computational issues are discussed, namely, the choice of
the starting approximations needed to start the Ehrlich-Aberth iteration, the
computation of the Newton correction, the halting criterion, and the treatment
of eigenvalues at infinity. We arrive at an effective implementation which
provides more accurate approximations to the eigenvalues with respect to the
methods based on the QZ algorithm. The case of polynomials having special
structures, like palindromic, Hamiltonian, symplectic, etc., where the
eigenvalues have special symmetries in the complex plane, is considered. A
general way to adapt the Ehrlich-Aberth iteration to structured matrix
polynomial is introduced. Numerical experiments which confirm the effectiveness
of this approach are reported.Comment: Submitted to Linear Algebra App
Revisiting Matrix Product on Master-Worker Platforms
This paper is aimed at designing efficient parallel matrix-product algorithms
for heterogeneous master-worker platforms. While matrix-product is
well-understood for homogeneous 2D-arrays of processors (e.g., Cannon algorithm
and ScaLAPACK outer product algorithm), there are three key hypotheses that
render our work original and innovative:
- Centralized data. We assume that all matrix files originate from, and must
be returned to, the master.
- Heterogeneous star-shaped platforms. We target fully heterogeneous
platforms, where computational resources have different computing powers.
- Limited memory. Because we investigate the parallelization of large
problems, we cannot assume that full matrix panels can be stored in the worker
memories and re-used for subsequent updates (as in ScaLAPACK).
We have devised efficient algorithms for resource selection (deciding which
workers to enroll) and communication ordering (both for input and result
messages), and we report a set of numerical experiments on various platforms at
Ecole Normale Superieure de Lyon and the University of Tennessee. However, we
point out that in this first version of the report, experiments are limited to
homogeneous platforms
A fast solver for linear systems with displacement structure
We describe a fast solver for linear systems with reconstructable Cauchy-like
structure, which requires O(rn^2) floating point operations and O(rn) memory
locations, where n is the size of the matrix and r its displacement rank. The
solver is based on the application of the generalized Schur algorithm to a
suitable augmented matrix, under some assumptions on the knots of the
Cauchy-like matrix. It includes various pivoting strategies, already discussed
in the literature, and a new algorithm, which only requires reconstructability.
We have developed a software package, written in Matlab and C-MEX, which
provides a robust implementation of the above method. Our package also includes
solvers for Toeplitz(+Hankel)-like and Vandermonde-like linear systems, as
these structures can be reduced to Cauchy-like by fast and stable transforms.
Numerical experiments demonstrate the effectiveness of the software.Comment: 27 pages, 6 figure
- âŠ