6,838 research outputs found

    Revisiting Matrix Product on Master-Worker Platforms

    Get PDF
    This paper is aimed at designing efficient parallel matrix-product algorithms for heterogeneous master-worker platforms. While matrix-product is well-understood for homogeneous 2D-arrays of processors (e.g., Cannon algorithm and ScaLAPACK outer product algorithm), there are three key hypotheses that render our work original and innovative: - Centralized data. We assume that all matrix files originate from, and must be returned to, the master. - Heterogeneous star-shaped platforms. We target fully heterogeneous platforms, where computational resources have different computing powers. - Limited memory. Because we investigate the parallelization of large problems, we cannot assume that full matrix panels can be stored in the worker memories and re-used for subsequent updates (as in ScaLAPACK). We have devised efficient algorithms for resource selection (deciding which workers to enroll) and communication ordering (both for input and result messages), and we report a set of numerical experiments on various platforms at Ecole Normale Superieure de Lyon and the University of Tennessee. However, we point out that in this first version of the report, experiments are limited to homogeneous platforms

    A Case Study in Coordination Programming: Performance Evaluation of S-Net vs Intel's Concurrent Collections

    Get PDF
    We present a programming methodology and runtime performance case study comparing the declarative data flow coordination language S-Net with Intel's Concurrent Collections (CnC). As a coordination language S-Net achieves a near-complete separation of concerns between sequential software components implemented in a separate algorithmic language and their parallel orchestration in an asynchronous data flow streaming network. We investigate the merits of S-Net and CnC with the help of a relevant and non-trivial linear algebra problem: tiled Cholesky decomposition. We describe two alternative S-Net implementations of tiled Cholesky factorization and compare them with two CnC implementations, one with explicit performance tuning and one without, that have previously been used to illustrate Intel CnC. Our experiments on a 48-core machine demonstrate that S-Net manages to outperform CnC on this problem.Comment: 9 pages, 8 figures, 1 table, accepted for PLC 2014 worksho

    Strategies to optimize the LU factorization algorithm on multicore computers

    Get PDF
    The number of cores in multicore computers has an irreversible tendency to increase. Also, computers with multiple sockets to insert multicore chips are based on a complex hardware design and are becoming more common. To parallelize the algorithms that run on this type of computers in order to obtain a higher performance rate, is a goal that can only be achieved by taking into account hardware architecture. As hardware evolves, so must software. This leads to old parallelization strategies quickly become obsolete. This paper presents a series of alternatives for parallelization the LU factorization algorithm and its results intended to running on a multicore system. Simple strategies lead to poor results. This study presents complex strategies that merge double levels of parallelism with asynchronous scheduling whose results reach up to the State-of-the-art in the field and even go further.http://hpc2013.hpclatam.org/talks.html#fullpaper17Fil: Soler, Janet. Universidad Nacional de Córdoba. Facultad de Ciencias Exactas, Físicas y Naturales. Laboratorio de Computación; Argentina.Fil: Ortiz, Javier. Universidad Nacional de Córdoba. Facultad de Ciencias Exactas, Físicas y Naturales. Laboratorio de Computación; Argentina.Fil: Wolfmann, Aaron Gustavo. Universidad Nacional de Córdoba. Facultad de Ciencias Exactas, Físicas y Naturales. Laboratorio de Computación; Argentina.Ciencias de la Computació
    corecore