57 research outputs found

    Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors

    Full text link
    This paper presents a low-overhead optimizer for the ubiquitous sparse matrix-vector multiplication (SpMV) kernel. Architectural diversity among different processors together with structural diversity among different sparse matrices lead to bottleneck diversity. This justifies an SpMV optimizer that is both matrix- and architecture-adaptive through runtime specialization. To this direction, we present an approach that first identifies the performance bottlenecks of SpMV for a given sparse matrix on the target platform either through profiling or by matrix property inspection, and then selects suitable optimizations to tackle those bottlenecks. Our optimization pool is based on the widely used Compressed Sparse Row (CSR) sparse matrix storage format and has low preprocessing overheads, making our overall approach practical even in cases where fast decision making and optimization setup is required. We evaluate our optimizer on three x86-based computing platforms and demonstrate that it is able to distinguish and appropriately optimize SpMV for the majority of matrices in a representative test suite, leading to significant speedups over the CSR and Inspector-Executor CSR SpMV kernels available in the latest release of the Intel MKL library.Comment: 10 pages, 7 figures, ICPP 201

    CoCoPeLia: Communication-Computation Overlap Prediction for Efficient Linear Algebra on GPUs

    Get PDF
    Graphics Processing Units (GPUs) are well established in HPC systems and frequently used to accelerate linear algebra routines. Since data transfers pose a severe bottleneck for GPU offloading, modern GPUs provide the ability to overlap communication with computation by splitting the problem to fine-grained sub-kernels that are executed in a pipelined manner. This optimization is currently underutilized by GPU BLAS libraries, since it requires an approach to select an efficient tiling size, which in turn leads to a challenging problem that needs to consider routine, system, data, and problem-specific characteristics. In this work, we introduce an elaborate 3-way concurrency model for GPU BLAS offload time that considers previously neglected features regarding data access and machine behavior. We then incorporate our model in an automated, end-to-end framework (called CoCoPeLia) that supports overlap prediction, tile selection and effective tile scheduling. We validate our model's efficacy for dgemm, sgemm, and daxpy on two testbeds, with our experimental results showing that it achieves significantly lower prediction error than previous models and provides near-optimal tiling sizes for all problems. We also demonstrate that CoCoPeLia leads to considerable performance improvements compared to the state of the art BLAS routine implementations for GPUs

    Modeling the Scalability of the EuroExa Reconfigurable Accelerators - Preliminary Results

    Get PDF
    Current technology and application trends push for both performance and power efficiency. EuroEXA is a project that tries to achieve these goals and push its performance to exascale performance. Towards this objective, EuroEXA node integrate reconfigurable (FPGA) accelerators to offload computational intensive workloads. To fully utilize the FPGA’s resource pool, multiple accelerators must be instantiated. System design and dimensioning requires an early performance estimation to evaluate different design options, including using larger FPGA devices, instantiating larger number of accelerator instances, etc. In this paper, we present the preliminary results of modeling the scalability of EuroEXA reconfigurable accelerators in the FPGA fabric. We start by using simple equations to bound the total number of kernels that can work in parallel depending on the available memory channels and reconfigurable resources. Then, we use a 2nd degree polynomial model to predict the performance benefits of instantiating multiple replicated kernels in a FPGA. The model suggests whether the switching to another larger FPGA is advantageous choice in terms of performance. We verify our results using micro-benchmarks on two state-of-the-art FPGAs; AlveoU50 and AlveoU280

    Adapt or Become Extinct!:The Case for a Unified Framework for Deployment-Time Optimization

    Get PDF
    The High-Performance Computing ecosystem consists of a large variety of execution platforms that demonstrate a wide diversity in hardware characteristics such as CPU architecture, memory organization, interconnection network, accelerators, etc. This environment also presents a number of hard boundaries (walls) for applications which limit software development (parallel programming wall), performance (memory wall, communication wall) and viability (power wall). The only way to survive in such a demanding environment is by adaptation. In this paper we discuss how dynamic information collected during the execution of an application can be utilized to adapt the execution context and may lead to performance gains beyond those provided by static information and compile-time adaptation. We consider specialization based on dynamic information like user input, architectural characteristics such as the memory hierarchy organization, and the execution profile of the application as obtained from the execution platform\u27s performance monitoring units. One of the challenges of future execution platforms is to allow the seamless integration of these various kinds of information with information obtained from static analysis (either during ahead-of-time or just-in-time) compilation. We extend the notion of information-driven adaptation and outline the architecture of an infrastructure designed to enable information flow and adaptation through-out the life-cycle of an application

    PARALiA: a performance aware runtime for auto-tuning linear algebra on heterogeneous systems

    Get PDF
    Dense linear algebra operations appear very frequently in high-performance computing (HPC) applications, rendering their performance crucial to achieve optimal scalability. As many modern HPC clusters contain multi-GPU nodes, BLAS operations are frequently offloaded on GPUs, necessitating the use of optimized libraries to ensure good performance. Unfortunately, multi-GPU systems are accompanied by two significant optimization challenges: data transfer bottlenecks as well as problem splitting and scheduling in multiple workers (GPUs) with distinct memories. We demonstrate that the current multi-GPU BLAS methods for tackling these challenges target very specific problem and data characteristics, resulting in serious performance degradation for any slightly deviating workload. Additionally, an even more critical decision is omitted because it cannot be addressed using current scheduler-based approaches: the determination of which devices should be used for a certain routine invocation. To address these issues we propose a model-based approach: using performance estimation to provide problem-specific autotuning during runtime. We integrate this autotuning into an end-to-end BLAS framework named PARALiA. This framework couples autotuning with an optimized task scheduler, leading to near-optimal data distribution and performance-aware resource utilization. We evaluate PARALiA in an HPC testbed with 8 NVIDIA-V100 GPUs, improving the average performance of GEMM by 1.7× and energy efficiency by 2.5× over the state-of-the-art in a large and diverse dataset and demonstrating the adaptability of our performance-aware approach to future heterogeneous systems

    SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures

    Get PDF
    Near-Data-Processing (NDP) architectures present a promising way to alleviate data movement costs and can provide significant performance and energy benefits to parallel applications. Typically, NDP architectures support several NDP units, each including multiple simple cores placed close to memory. To fully leverage the benefits of NDP and achieve high performance for parallel workloads, efficient synchronization among the NDP cores of a system is necessary. However, supporting synchronization in many NDP systems is challenging because they lack shared caches and hardware cache coherence support, which are commonly used for synchronization in multicore systems, and communication across different NDP units can be expensive. This paper comprehensively examines the synchronization problem in NDP systems, and proposes SynCron, an end-to-end synchronization solution for NDP systems. SynCron adds low-cost hardware support near memory for synchronization acceleration, and avoids the need for hardware cache coherence support. SynCron has three components: 1) a specialized cache memory structure to avoid memory accesses for synchronization and minimize latency overheads, 2) a hierarchical message-passing communication protocol to minimize expensive communication across NDP units of the system, and 3) a hardware-only overflow management scheme to avoid performance degradation when hardware resources for synchronization tracking are exceeded. We evaluate SynCron using a variety of parallel workloads, covering various contention scenarios. SynCron improves performance by 1.27×\times on average (up to 1.78×\times) under high-contention scenarios, and by 1.35×\times on average (up to 2.29×\times) under low-contention real applications, compared to state-of-the-art approaches. SynCron reduces system energy consumption by 2.08×\times on average (up to 4.25×\times).Comment: To appear in the 27th IEEE International Symposium on High-Performance Computer Architecture (HPCA-27
    • …
    corecore