155 research outputs found

    Speculative Segmented Sum for Sparse Matrix-Vector Multiplication on Heterogeneous Processors

    Full text link
    Sparse matrix-vector multiplication (SpMV) is a central building block for scientific software and graph applications. Recently, heterogeneous processors composed of different types of cores attracted much attention because of their flexible core configuration and high energy efficiency. In this paper, we propose a compressed sparse row (CSR) format based SpMV algorithm utilizing both types of cores in a CPU-GPU heterogeneous processor. We first speculatively execute segmented sum operations on the GPU part of a heterogeneous processor and generate a possibly incorrect results. Then the CPU part of the same chip is triggered to re-arrange the predicted partial sums for a correct resulting vector. On three heterogeneous processors from Intel, AMD and nVidia, using 20 sparse matrices as a benchmark suite, the experimental results show that our method obtains significant performance improvement over the best existing CSR-based SpMV algorithms. The source code of this work is downloadable at https://github.com/bhSPARSE/Benchmark_SpMV_using_CSRComment: 22 pages, 8 figures, Published at Parallel Computing (PARCO

    Characterizing Scalability of Sparse Matrix–Vector Multiplications on Phytium FT-2000+

    Get PDF
    Understanding the scalability of parallel programs is crucial for software optimization and hardware architecture design. As HPC hardware is moving towards many-core design, it becomes increasingly difficult for a parallel program to make effective use of all available processor cores. This makes scalability analysis increasingly important. This paper presents a quantitative study for characterizing the scalability of sparse matrix–vector multiplications (SpMV) on Phytium FT-2000+, an ARM-based HPC many-core architecture. We choose SpMV as it is a common operation in scientific and HPC applications. Due to the newness of ARM-based many-core architectures, there is little work on understanding the SpMV scalability on such hardware design. To close the gap, we carry out a large-scale empirical evaluation involved over 1000 representative SpMV datasets. We show that, while many computation-intensive SpMV applications contain extensive parallelism, achieving a linear speedup is non-trivial on Phytium FT-2000+. To better understand what software and hardware parameters are most important for determining the scalability of a given SpMV kernel, we develop a performance analytical model based on the regression tree. We show that our model is highly effective in characterizing SpMV scalability, offering useful insights to help application developers for better optimizing SpMV on an emerging HPC architecture

    Exploring the potential for accelerating sparse matrix-vector product on a Processing-in-Memory architecture

    Get PDF
    As the importance of memory access delays on performance has mushroomed over the past few decades, researchers have begun exploring Processing-in-Memory (PIM) technology, which offers higher memory bandwidth, lower memory latency, and lower power consumption. In this study, we investigate whether an emerging PIM design from Sandia National Laboratories can boost performance for sparse matrix-vector product (SMVP). While SMVP is in the best-case bandwidth-bound, factors related to matrix structure and representation also limit performance. We analyze SMVP both in the context of an AMD Opteron processor and the Sandia PIM, exploring the performance limiters for each and the degree to which these can be ameliorated by data and code transformations. Over a range of sparse matrices, SMVP on the PIM outperformed the Opteron by a factor of 1.82. On the PIM, computational kernel and data structure transformations improved performance by almost 40% over conventional implementations using compressed-sparse row format

    Recent Advances in Graph Partitioning

    Full text link
    We survey recent trends in practical algorithms for balanced graph partitioning together with applications and future research directions

    木を用いた構造化並列プログラミング

    Get PDF
    High-level abstractions for parallel programming are still immature. Computations on complicated data structures such as pointer structures are considered as irregular algorithms. General graph structures, which irregular algorithms generally deal with, are difficult to divide and conquer. Because the divide-and-conquer paradigm is essential for load balancing in parallel algorithms and a key to parallel programming, general graphs are reasonably difficult. However, trees lead to divide-and-conquer computations by definition and are sufficiently general and powerful as a tool of programming. We therefore deal with abstractions of tree-based computations. Our study has started from Matsuzaki’s work on tree skeletons. We have improved the usability of tree skeletons by enriching their implementation aspect. Specifically, we have dealt with two issues. We first have implemented the loose coupling between skeletons and data structures and developed a flexible tree skeleton library. We secondly have implemented a parallelizer that transforms sequential recursive functions in C into parallel programs that use tree skeletons implicitly. This parallelizer hides the complicated API of tree skeletons and makes programmers to use tree skeletons with no burden. Unfortunately, the practicality of tree skeletons, however, has not been improved. On the basis of the observations from the practice of tree skeletons, we deal with two application domains: program analysis and neighborhood computation. In the domain of program analysis, compilers treat input programs as control-flow graphs (CFGs) and perform analysis on CFGs. Program analysis is therefore difficult to divide and conquer. To resolve this problem, we have developed divide-and-conquer methods for program analysis in a syntax-directed manner on the basis of Rosen’s high-level approach. Specifically, we have dealt with data-flow analysis based on Tarjan’s formalization and value-graph construction based on a functional formalization. In the domain of neighborhood computations, a primary issue is locality. A naive parallel neighborhood computation without locality enhancement causes a lot of cache misses. The divide-and-conquer paradigm is known to be useful also for locality enhancement. We therefore have applied algebraic formalizations and a tree-segmenting technique derived from tree skeletons to the locality enhancement of neighborhood computations.電気通信大学201

    Gunrock: GPU Graph Analytics

    Full text link
    For large-scale graph analytics on the GPU, the irregularity of data access and control flow, and the complexity of programming GPUs, have presented two significant challenges to developing a programmable high-performance graph library. "Gunrock", our graph-processing system designed specifically for the GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on a vertex or edge frontier. Gunrock achieves a balance between performance and expressiveness by coupling high performance GPU computing primitives and optimization strategies with a high-level programming model that allows programmers to quickly develop new graph primitives with small code size and minimal GPU programming knowledge. We characterize the performance of various optimization strategies and evaluate Gunrock's overall performance on different GPU architectures on a wide range of graph primitives that span from traversal-based algorithms and ranking algorithms, to triangle counting and bipartite-graph-based algorithms. The results show that on a single GPU, Gunrock has on average at least an order of magnitude speedup over Boost and PowerGraph, comparable performance to the fastest GPU hardwired primitives and CPU shared-memory graph libraries such as Ligra and Galois, and better performance than any other GPU high-level graph library.Comment: 52 pages, invited paper to ACM Transactions on Parallel Computing (TOPC), an extended version of PPoPP'16 paper "Gunrock: A High-Performance Graph Processing Library on the GPU

    Efficient GPU-accelerated fitting of observational health-scaled stratified and time-varying Cox models

    Full text link
    The Cox proportional hazards model stands as a widely-used semi-parametric approach for survival analysis in medical research and many other fields. Numerous extensions of the Cox model have further expanded its versatility. Statistical computing challenges arise, however, when applying many of these extensions with the increasing complexity and volume of modern observational health datasets. To address these challenges, we demonstrate how to employ massive parallelization through graphics processing units (GPU) to enhance the scalability of the stratified Cox model, the Cox model with time-varying covariates, and the Cox model with time-varying coefficients. First we establish how the Cox model with time-varying coefficients can be transformed into the Cox model with time-varying covariates when using discrete time-to-event data. We then demonstrate how to recast both of these into a stratified Cox model and identify their shared computational bottleneck that results when evaluating the now segmented partial likelihood and its gradient with respect to regression coefficients at scale. These computations mirror a highly transformed segmented scan operation. While this bottleneck is not an immediately obvious target for multi-core parallelization, we convert it into an un-segmented operation to leverage the efficient many-core parallel scan algorithm. Our massively parallel implementation significantly accelerates model fitting on large-scale and high-dimensional Cox models with stratification or time-varying effect, delivering an order of magnitude speedup over traditional central processing unit-based implementations
    • …
    corecore