15 research outputs found

    Towards an auto-tuned and task-based SpMV (LASs Library)

    Get PDF
    We present a novel approach to parallelize the SpMV kernel included in LASs (Linear Algebra routines on OmpSs) library, after a deep review and analysis of several well-known approaches. LASs is based on OmpSs, a task-based runtime that extends OpenMP directives, providing more flexibility to apply new strategies. Based on tasking and nesting, with the aim of improving the workload imbalance inherent to the SpMV operation, we present a strategy especially useful for highly imbalanced input matrices. In this approach, the number of created tasks is dynamically decided in order to maximize the use of the resources of the platform. Throughout this paper, SpMV behavior depending on the selected strategy (state of the art and proposed strategies) is deeply analyzed, setting in this way the base for a future auto-tunable code that is able to select the most suitable approach depending on the input matrix. The experiments of this work were carried out for a set of 12 matrices from the Suite Sparse Matrix Collection, all of them with different characteristics regarding their sparsity. The experiments of this work were performed on a node of Marenostrum 4 supercomputer (with two sockets Intel Xeon, 24 cores each) and on a node of Dibona cluster (using one ARM ThunderX2 socket with 32 cores). Our tests show that, for Intel Xeon, the best parallelization strategy reduces the execution time of the reference MKL multi-threaded version up to 67%. On ARM ThunderX2, the reduction is up to 56% with respect to the OmpSs parallel reference.This project has received funding from the Spanish Ministry of Economy and Competitiveness under the project Computación de Altas Prestaciones VII (TIN2015- 65316-P), the Departament d’Innovació, Universitats i Empresa de la Generalitat de Catalunya, under project MPEXPAR: Models de Programació i Entorns d’Execució Parallels (2014-SGR-1051), and the Juan de la Cierva Grant Agreement No IJCI-2017- 33511, and the Spanish Ministry of Science and Innovation under the project Heterogeneidad y especialización en la era post-Moore (RTI2018-093684-BI00). We also acknowledge the funding provided by Fujitsu under the BSC-Fujitsu joint project: Math Libraries Migration and OptimizationPeer ReviewedPostprint (author's final draft

    A new generation of task-parallel algorithms for matrix inversion in many-threaded CPUs

    Get PDF
    We take advantage of the new tasking features in OpenMP to propose advanced task-parallel algorithms for the inversion of dense matrices via Gauss-Jordan elimination. Our algorithms perform a partitioning of the matrix operand into two levels of tasks: The matrix is first divided vertically, by column blocks (or panels), in order to accommodate the standard partial pivoting scheme that ensures the numerical stability of the method. In addition, depending on the particular kernel to be applied, each panel is partitioned either horizontally by row blocks (tiles) or vertically by µ-panels (of columns), in order to extract sufficient task parallelism to feed a many-threaded general purpose processor (CPU). The results of the experimental evaluation show the performance benefits of the advanced tasking algorithms on an Intel Xeon Gold processor with 20 cores.This research was sponsored by projects RTI2018-093684-B-I00 and TIN2017-82972-R of Ministerio de Ciencia, Innovación y Universidades; project S2018/TCS-4423 of Comunidad de Madrid; and project PR65/19-22445 of Universidad Complutense de Madrid.Peer ReviewedPostprint (author's final draft

    Programming parallel dense matrix factorizations and inversion for new-generation NUMA architectures

    Get PDF
    We propose a methodology to address the programmability issues derived from the emergence of new-generation shared-memory NUMA architectures. For this purpose, we employ dense matrix factorizations and matrix inversion (DMFI) as a use case, and we target two modern architectures (AMD Rome and Huawei Kunpeng 920) that exhibit configurable NUMA topologies. Our methodology pursues performance portability across different NUMA configurations by proposing multi-domain implementations for DMFI plus a hybrid task- and loop-level parallelization that configures multi-threaded executions to fix core-to-data binding, exploiting locality at the expense of minor code modifications. In addition, we introduce a generalization of the multi-domain implementations for DMFI that offers support for virtually any NUMA topology in present and future architectures. Our experimentation on the two target architectures for three representative dense linear algebra operations validates the proposal, reveals insights on the necessity of adapting both the codes and their execution to improve data access locality, and reports performance across architectures and inter- and intra-socket NUMA configurations competitive with state-of-the-art message-passing implementations, maintaining the ease of development usually associated with shared-memory programming.This research was sponsored by project PID2019-107255GB of Ministerio de Ciencia, Innovación y Universidades; project S2018/TCS-4423 of Comunidad de Madrid; project 2017-SGR-1414 of the Generalitat de Catalunya and the Madrid Government under the Multiannual Agreement with UCM in the line Program to Stimulate Research for Young Doctors in the context of the V PRICIT, project PR65/19-22445. This project has also received funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 955558. The JU receives support from the European Union’s Horizon 2020 research and innovation programme, and Spain, Germany, France, Italy, Poland, Switzerland, Norway. The work is also supported by grants PID2020-113656RB-C22 and PID2021-126576NB-I00 of MCIN/AEI/10.13039/501100011033 and by ERDF A way of making Europe.Peer ReviewedPostprint (published version

    Ahora / Ara

    Get PDF
    La cinquena edició del microrelatari per l’eradicació de la violència contra les dones de l’Institut Universitari d’Estudis Feministes i de Gènere «Purificación Escribano» de la Universitat Jaume I vol ser una declaració d’esperança. Aquest és el moment en el qual les dones (i els homes) hem de fer un pas endavant i eliminar la violència sistèmica contra les dones. Ara és el moment de denunciar el masclisme i els micromasclismes començant a construir una societat més igualitària. Cadascun dels relats del llibre és una denúncia i una declaració que ens encamina cap a un món millor

    Multithreaded Dense Linear Algebra on Asymmetric Multi-core Processors

    Get PDF
    This dissertation targets two important problems. The first one is the design of low-level DLA kernels for architectures comprising two (or more) classes of cores. The main question we have to address here is how to attain a balanced distribution of the computational workload among the heterogeneous cores while taking into account that some of the resources, in particular cache levels, are either shared or private. The second question is partially related to the first one. Concretely, this dissertation explores an alternative to runtime-based systems in order to extract “sufficient" parallelism from complex DLA operations while making an efficient use of the cache hierarchy of the architecture. Thus, the main goal of this thesis is the study, design, development and analysis of experimental solutions that are architecture-aware for the execution of DLA operations on low power architectures, more specically asymmetric platforms.Esta tesis doctoral aborda dos problemas importantes. El primero es el diseño de kernels DLA de bajo nivel para arquitecturas compuestas por dos (o más) tipos de cores. La principal cuestión en este caso es como obtener un distribución de carga computacional balanceada entre los cores heterogéneos mientras se tiene en cuenta que algunos recursos, en particular los niveles de cache, son bien compartidos o privados. La segunda cuestión está parcialmente relacionada con la primera. Concretamente, en la teis se explora un alternativa a los sistemas basados en runtimes para extraer paralelismo sufciente para operaciones DLA complejas mientras se hace un uso eficiente de la jerarquía de cache de la arquitectura. Por tanto, el objetivo general de esta tesis es el estudio, diseño, desarrollo y análsis de soluciones experimentales que son conscientes de la arquitectura para la ejecución de operaciones DLA en arquitecturas de bajo consumo, más concretamente sistemas asimétricos.Programa de Doctorat en Informàtic

    BLAS-3 optimized by OmpSs regions (LASs library)

    Get PDF
    In this paper we propose a set of optimizations for the BLAS-3 routines of LASs library (Linear Algebra routines on OmpSs) and perform a detailed analysis of the impact of the proposed changes in terms of performance and execution time. OmpSs allows to use regions in the dependences of the tasks. This helps not only in the programming of the algorithmic optimizations, but also in the reduction of the execution time achieved by such optimizations. Different strategies are implemented in order to reduce the amount of tasks created (when there is enough parallelism) during the execution of BLAS-3 operations in the original LASs. Also a better IPC is obtained thanks to a better memory hierarchy exploitation. More specifically, we increase the performance, in particular on big matrices, about 12% for TRSM, and 17% for GEMM with respect to the original version of LASs, even using less cores in the case of GEMM/SYMM. Moreover, when LASs is compared to the OpenMP reference dense linear algebra library PLASMA, performance is increased up to 12.5% for GEMM/SYMM, while for TRSM/TRMM this value raises to 15%.This project has received funding from the Spanish Ministry of Economy and Competitiveness under the project Computacion de Altas Prestaciones VII (TIN2015-65316-P), the Departament d’Innovacio, Universitats i Empresa de la Generalitat de Catalunya, under project MPEXPAR: Models de Programacio i Entorns d’Execució Parallels (2014-SGR-1051), and the Juan de la Cierva Grant Agreement No IJCI2017-33511. We also acknowledge the funding provided by Fujitsu under the BSC-Fujitsu joint project: Math Libraries Migration and Optimization.Peer Reviewe

    Static versus dynamic task scheduling of the Lu factorization on ARM big. LITTLE architectures

    No full text
    We investigate several parallel algorithmic variants of the LU factorization with partial pivoting (LUpp) that trade off the exploitation of increasing levels of task-parallelism in exchange for a more cache-oblivious execution. In particular, our first variant corresponds to the classical implementation of LUpp in the legacy version of LAPACK, which constrains the concurrency exploited to that intrinsic to the basic linear algebra kernels that appear during the factorization, but exerts an strict control of the cache memory and a static mapping of kernels to cores. A second variant relaxes this task-constrained scenario by introducing a look-ahead of depth one to increase task-parallelism, increasing the pressure on the cache system in terms of cache misses. Finally, the third variant orchestrates an execution where the degree of concurrency is only limited by the actual data dependencies in LUpp, potentially yielding to a higher volume of conflicts due to competition for the cache memory resources. The target platform for our implementations and experiments is a specific asymmetric multicore processor (AMP) from ARM, which introduces the additional scheduling complexity of having to deal with two distinct types of cores; and an L2-shared cache per cluster of the AMP, which results in more conflictivity in the access to this key cache level.The researchers from Universidad Jaume I were supported by project TIN2014-53495-R of MINECO and FEDER, and the FPU program of MECD. The researcher from Universitat Politecnica de Catalunya was supported by projects TIN2015-65316-P from the Spanish Ministry of Education and 2014 SGR 1051 from the Generalitat de Catalunya, Dep. d’Innovacio, Universitats i EmpresaPeer ReviewedPostprint (published version

    Static scheduling of the LU factorization with look-ahead on asymmetric multicore processors

    No full text
    We analyze the benefits of look-ahead in the parallel execution of the LU factorization with partial pivoting (LUpp) in two distinct “asymmetric” multicore scenarios. The first one corresponds to an actual hardware-asymmetric architecture such as the Samsung Exynos 5422 system-on-chip (SoC), equipped with an ARM big.LITTLE processor consisting of a quad-core Cortex-A15 cluster plus a quad-core Cortex-A7 cluster. For this scenario, we propose a careful mapping of the different types of tasks appearing in LUpp to the computational resources, in order to produce an efficient architecture-aware exploitation of the computational resources integrated in this SoC. The second asymmetric configuration appears in a hardware-symmetric multicore architecture where the cores can individually operate at a different frequency levels. In this scenario, we show how to employ the frequency slack to accelerate the tasks in the critical path of LUpp in order to produce a faster global execution as well as a lower energy consumption.Peer Reviewe

    sLASs: a fully automatic auto-tuned linear algebra library based on OpenMP extensions implemented in OmpSs (LASs Library)

    No full text
    © 2019 Elsevier. This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/In this work we have implemented a novel Linear Algebra Library on top of the task-based runtime OmpSs-2. We have used some of the most advanced OmpSs-2 features; weak dependencies and regions, together with the final clause for the implementation of auto-tunable code for the BLAS-3 trsm routine and the LAPACK routines npgetrf and npgesv. All these implementations are part of the first prototype of sLASs library, a novel library for auto-tunable codes for linear algebra operations based on LASs library. In all these cases, the use of the OmpSs-2 features presents an improvement in terms of execution time against other reference libraries such as, the original LASs library, PLASMA, ATLAS and Intel MKL. These codes are able to reduce the execution time in about 18% on big matrices, by increasing the IPC on gemm and reducing the time of task instantiation. For a few medium matrices, benefits are also seen. For small matrices and a subset of medium matrices, specific optimizations that allow to increase the degree of parallelism in both, gemm and trsm tasks, are applied. This strategy achieves an increment in performance of up to 40%.This project has received funding from the Spanish Ministry of Economy and Competitiveness, Spain under the project Computación de Altas Prestaciones VII (TIN2015- 65316-P), the Departament d’Innovació, Universitats i Empresa de la Generalitat de Catalunya, Spain, under project MPEXPAR: Models de Programació i Entorns d’Execució Parallels (2014-SGR-1051), and the Juan de la Cierva Grant Agreement No IJCI-2017- 33511. We also acknowledge the funding provided by Fujitsu, Japan under the BSC-Fujitsu joint project: Math Libraries Migration and Optimization.Peer ReviewedPostprint (author's final draft
    corecore