11,532 research outputs found

    Cache-aware Performance Modeling and Prediction for Dense Linear Algebra

    Full text link
    Countless applications cast their computational core in terms of dense linear algebra operations. These operations can usually be implemented by combining the routines offered by standard linear algebra libraries such as BLAS and LAPACK, and typically each operation can be obtained in many alternative ways. Interestingly, identifying the fastest implementation -- without executing it -- is a challenging task even for experts. An equally challenging task is that of tuning each routine to performance-optimal configurations. Indeed, the problem is so difficult that even the default values provided by the libraries are often considerably suboptimal; as a solution, normally one has to resort to executing and timing the routines, driven by some form of parameter search. In this paper, we discuss a methodology to solve both problems: identifying the best performing algorithm within a family of alternatives, and tuning algorithmic parameters for maximum performance; in both cases, we do not execute the algorithms themselves. Instead, our methodology relies on timing and modeling the computational kernels underlying the algorithms, and on a technique for tracking the contents of the CPU cache. In general, our performance predictions allow us to tune dense linear algebra algorithms within few percents from the best attainable results, thus allowing computational scientists and code developers alike to efficiently optimize their linear algebra routines and codes.Comment: Submitted to PMBS1

    Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes

    Get PDF
    The ongoing hardware evolution exhibits an escalation in the number, as well as in the heterogeneity, of computing resources. The pressure to maintain reasonable levels of performance and portability forces application developers to leave the traditional programming paradigms and explore alternative solutions. PaStiX is a parallel sparse direct solver, based on a dynamic scheduler for modern hierarchical manycore architectures. In this paper, we study the benefits and limits of replacing the highly specialized internal scheduler of the PaStiX solver with two generic runtime systems: PaRSEC and StarPU. The tasks graph of the factorization step is made available to the two runtimes, providing them the opportunity to process and optimize its traversal in order to maximize the algorithm efficiency for the targeted hardware platform. A comparative study of the performance of the PaStiX solver on top of its native internal scheduler, PaRSEC, and StarPU frameworks, on different execution environments, is performed. The analysis highlights that these generic task-based runtimes achieve comparable results to the application-optimized embedded scheduler on homogeneous platforms. Furthermore, they are able to significantly speed up the solver on heterogeneous environments by taking advantage of the accelerators while hiding the complexity of their efficient manipulation from the programmer.Comment: Heterogeneity in Computing Workshop (2014

    A bibliography on parallel and vector numerical algorithms

    Get PDF
    This is a bibliography of numerical methods. It also includes a number of other references on machine architecture, programming language, and other topics of interest to scientific computing. Certain conference proceedings and anthologies which have been published in book form are listed also

    Solving polynomial eigenvalue problems by means of the Ehrlich-Aberth method

    Full text link
    Given the n×nn\times n matrix polynomial P(x)=∑i=0kPixiP(x)=\sum_{i=0}^kP_i x^i, we consider the associated polynomial eigenvalue problem. This problem, viewed in terms of computing the roots of the scalar polynomial det⁡P(x)\det P(x), is treated in polynomial form rather than in matrix form by means of the Ehrlich-Aberth iteration. The main computational issues are discussed, namely, the choice of the starting approximations needed to start the Ehrlich-Aberth iteration, the computation of the Newton correction, the halting criterion, and the treatment of eigenvalues at infinity. We arrive at an effective implementation which provides more accurate approximations to the eigenvalues with respect to the methods based on the QZ algorithm. The case of polynomials having special structures, like palindromic, Hamiltonian, symplectic, etc., where the eigenvalues have special symmetries in the complex plane, is considered. A general way to adapt the Ehrlich-Aberth iteration to structured matrix polynomial is introduced. Numerical experiments which confirm the effectiveness of this approach are reported.Comment: Submitted to Linear Algebra App

    Revisiting Matrix Product on Master-Worker Platforms

    Get PDF
    This paper is aimed at designing efficient parallel matrix-product algorithms for heterogeneous master-worker platforms. While matrix-product is well-understood for homogeneous 2D-arrays of processors (e.g., Cannon algorithm and ScaLAPACK outer product algorithm), there are three key hypotheses that render our work original and innovative: - Centralized data. We assume that all matrix files originate from, and must be returned to, the master. - Heterogeneous star-shaped platforms. We target fully heterogeneous platforms, where computational resources have different computing powers. - Limited memory. Because we investigate the parallelization of large problems, we cannot assume that full matrix panels can be stored in the worker memories and re-used for subsequent updates (as in ScaLAPACK). We have devised efficient algorithms for resource selection (deciding which workers to enroll) and communication ordering (both for input and result messages), and we report a set of numerical experiments on various platforms at Ecole Normale Superieure de Lyon and the University of Tennessee. However, we point out that in this first version of the report, experiments are limited to homogeneous platforms

    A fast solver for linear systems with displacement structure

    Full text link
    We describe a fast solver for linear systems with reconstructable Cauchy-like structure, which requires O(rn^2) floating point operations and O(rn) memory locations, where n is the size of the matrix and r its displacement rank. The solver is based on the application of the generalized Schur algorithm to a suitable augmented matrix, under some assumptions on the knots of the Cauchy-like matrix. It includes various pivoting strategies, already discussed in the literature, and a new algorithm, which only requires reconstructability. We have developed a software package, written in Matlab and C-MEX, which provides a robust implementation of the above method. Our package also includes solvers for Toeplitz(+Hankel)-like and Vandermonde-like linear systems, as these structures can be reduced to Cauchy-like by fast and stable transforms. Numerical experiments demonstrate the effectiveness of the software.Comment: 27 pages, 6 figure
    • 

    corecore