155 research outputs found
Speculative Segmented Sum for Sparse Matrix-Vector Multiplication on Heterogeneous Processors
Sparse matrix-vector multiplication (SpMV) is a central building block for
scientific software and graph applications. Recently, heterogeneous processors
composed of different types of cores attracted much attention because of their
flexible core configuration and high energy efficiency. In this paper, we
propose a compressed sparse row (CSR) format based SpMV algorithm utilizing
both types of cores in a CPU-GPU heterogeneous processor. We first
speculatively execute segmented sum operations on the GPU part of a
heterogeneous processor and generate a possibly incorrect results. Then the CPU
part of the same chip is triggered to re-arrange the predicted partial sums for
a correct resulting vector. On three heterogeneous processors from Intel, AMD
and nVidia, using 20 sparse matrices as a benchmark suite, the experimental
results show that our method obtains significant performance improvement over
the best existing CSR-based SpMV algorithms. The source code of this work is
downloadable at https://github.com/bhSPARSE/Benchmark_SpMV_using_CSRComment: 22 pages, 8 figures, Published at Parallel Computing (PARCO
Characterizing Scalability of Sparse MatrixâVector Multiplications on Phytium FT-2000+
Understanding the scalability of parallel programs is crucial for software optimization and hardware architecture design. As HPC hardware is moving towards many-core design, it becomes increasingly difficult for a parallel program to make effective use of all available processor cores. This makes scalability analysis increasingly important. This paper presents a quantitative study for characterizing the scalability of sparse matrixâvector multiplications (SpMV) on Phytium FT-2000+, an ARM-based HPC many-core architecture. We choose SpMV as it is a common operation in scientific and HPC applications. Due to the newness of ARM-based many-core architectures, there is little work on understanding the SpMV scalability on such hardware design. To close the gap, we carry out a large-scale empirical evaluation involved over 1000 representative SpMV datasets. We show that, while many computation-intensive SpMV applications contain extensive parallelism, achieving a linear speedup is non-trivial on Phytium FT-2000+. To better understand what software and hardware parameters are most important for determining the scalability of a given SpMV kernel, we develop a performance analytical model based on the regression tree. We show that our model is highly effective in characterizing SpMV scalability, offering useful insights to help application developers for better optimizing SpMV on an emerging HPC architecture
Exploring the potential for accelerating sparse matrix-vector product on a Processing-in-Memory architecture
As the importance of memory access delays on performance has mushroomed over the past few decades, researchers have begun exploring Processing-in-Memory (PIM) technology, which offers higher memory bandwidth, lower memory latency, and lower power consumption. In this study, we investigate whether an emerging PIM design from Sandia National Laboratories can boost performance for sparse matrix-vector product (SMVP). While SMVP is in the best-case bandwidth-bound, factors related to matrix structure and representation also limit performance. We analyze SMVP both in the context of an AMD Opteron processor and the Sandia PIM, exploring the performance limiters for each and the degree to which these can be ameliorated by data and code transformations. Over a range of sparse matrices, SMVP on the PIM outperformed the Opteron by a factor of 1.82. On the PIM, computational kernel and data structure transformations improved performance by almost 40% over conventional implementations using compressed-sparse row format
Recent Advances in Graph Partitioning
We survey recent trends in practical algorithms for balanced graph
partitioning together with applications and future research directions
ć¨ăç¨ăăć§é ĺ丌ĺăăă°ăŠăăłă°
High-level abstractions for parallel programming are still immature. Computations on complicated data structures such as pointer structures are considered as irregular algorithms. General graph structures, which irregular algorithms generally deal with, are difficult to divide and conquer. Because the divide-and-conquer paradigm is essential for load balancing in parallel algorithms and a key to parallel programming, general graphs are reasonably difficult. However, trees lead to divide-and-conquer computations by definition and are sufficiently general and powerful as a tool of programming. We therefore deal with abstractions of tree-based computations. Our study has started from Matsuzakiâs work on tree skeletons. We have improved the usability of tree skeletons by enriching their implementation aspect. Specifically, we have dealt with two issues. We first have implemented the loose coupling between skeletons and data structures and developed a flexible tree skeleton library. We secondly have implemented a parallelizer that transforms sequential recursive functions in C into parallel programs that use tree skeletons implicitly. This parallelizer hides the complicated API of tree skeletons and makes programmers to use tree skeletons with no burden. Unfortunately, the practicality of tree skeletons, however, has not been improved. On the basis of the observations from the practice of tree skeletons, we deal with two application domains: program analysis and neighborhood computation. In the domain of program analysis, compilers treat input programs as control-flow graphs (CFGs) and perform analysis on CFGs. Program analysis is therefore difficult to divide and conquer. To resolve this problem, we have developed divide-and-conquer methods for program analysis in a syntax-directed manner on the basis of Rosenâs high-level approach. Specifically, we have dealt with data-flow analysis based on Tarjanâs formalization and value-graph construction based on a functional formalization. In the domain of neighborhood computations, a primary issue is locality. A naive parallel neighborhood computation without locality enhancement causes a lot of cache misses. The divide-and-conquer paradigm is known to be useful also for locality enhancement. We therefore have applied algebraic formalizations and a tree-segmenting technique derived from tree skeletons to the locality enhancement of neighborhood computations.éťć°é俥大ĺŚ201
Gunrock: GPU Graph Analytics
For large-scale graph analytics on the GPU, the irregularity of data access
and control flow, and the complexity of programming GPUs, have presented two
significant challenges to developing a programmable high-performance graph
library. "Gunrock", our graph-processing system designed specifically for the
GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on
operations on a vertex or edge frontier. Gunrock achieves a balance between
performance and expressiveness by coupling high performance GPU computing
primitives and optimization strategies with a high-level programming model that
allows programmers to quickly develop new graph primitives with small code size
and minimal GPU programming knowledge. We characterize the performance of
various optimization strategies and evaluate Gunrock's overall performance on
different GPU architectures on a wide range of graph primitives that span from
traversal-based algorithms and ranking algorithms, to triangle counting and
bipartite-graph-based algorithms. The results show that on a single GPU,
Gunrock has on average at least an order of magnitude speedup over Boost and
PowerGraph, comparable performance to the fastest GPU hardwired primitives and
CPU shared-memory graph libraries such as Ligra and Galois, and better
performance than any other GPU high-level graph library.Comment: 52 pages, invited paper to ACM Transactions on Parallel Computing
(TOPC), an extended version of PPoPP'16 paper "Gunrock: A High-Performance
Graph Processing Library on the GPU
Efficient GPU-accelerated fitting of observational health-scaled stratified and time-varying Cox models
The Cox proportional hazards model stands as a widely-used semi-parametric
approach for survival analysis in medical research and many other fields.
Numerous extensions of the Cox model have further expanded its versatility.
Statistical computing challenges arise, however, when applying many of these
extensions with the increasing complexity and volume of modern observational
health datasets. To address these challenges, we demonstrate how to employ
massive parallelization through graphics processing units (GPU) to enhance the
scalability of the stratified Cox model, the Cox model with time-varying
covariates, and the Cox model with time-varying coefficients. First we
establish how the Cox model with time-varying coefficients can be transformed
into the Cox model with time-varying covariates when using discrete
time-to-event data. We then demonstrate how to recast both of these into a
stratified Cox model and identify their shared computational bottleneck that
results when evaluating the now segmented partial likelihood and its gradient
with respect to regression coefficients at scale. These computations mirror a
highly transformed segmented scan operation. While this bottleneck is not an
immediately obvious target for multi-core parallelization, we convert it into
an un-segmented operation to leverage the efficient many-core parallel scan
algorithm. Our massively parallel implementation significantly accelerates
model fitting on large-scale and high-dimensional Cox models with
stratification or time-varying effect, delivering an order of magnitude speedup
over traditional central processing unit-based implementations
Recommended from our members
Parallelisation of greedy algorithms for compressive sensing reconstruction
Compressive Sensing (CS) is a technique which allows a signal to be compressed at the same
time as it is captured. The process of capturing and simultaneously compressing the signal is
represented as linear sampling, which can encompass a variety of physical processes or signal
processing. Instead of explicitly identifying redundancies in the source signal, CS relies on the
property of sparsity in order to reconstruct the compressed signal. While linear sampling is
much less burdensome than conventional compression, this is more than made up for by the high
computational cost of reconstructing a signal which has been captured using CS. Even when
using some of the fastest reconstruction techniques, known as greedy pursuits, reconstruction
of large problems can pose a significant burden, consuming a great deal of memory as well as
compute time.
Parallel computing is the foundation of the field of High Performance Computing (HPC).
Modern supercomputers are generally composed of large clusters of standard servers, with a
dedicated low-latency high-bandwidth interconnect network. On such a cluster, an appropriately
written program can harness vast quantities of memory and computational power. However, in
order to exploit a parallel compute resource, an algorithm usually has to be redesigned from
the ground up. In this thesis I describe the development of parallel variants of two algorithms
commonly used in CS reconstruction, Matching Pursuit (MP) and Orthogonal Matching Pursuit
(OMP), resulting in the new distributed compute algorithms DistMP and DistOMP. I present
the results from experiments showing how DistMP and DistOMP can utilise a compute cluster
to solve CS problems much more quickly than a single computer could alone. Speed-up of as
much as a factor of 76 is observed with DistMP when utilising 210 workers across 14 servers,
compared to a single worker. Finally, I demonstrate how DistOMP can solve a problem with a
429GB equivalent sampling matrix in as little as 62 minutes using a 16-node compute cluster.Funded by an ICASE award from the Engineering and Physical Sciences Research Council, with sponsorship provided by Thales Research and Technology
- âŚ