4,934 research outputs found
Speculative Segmented Sum for Sparse Matrix-Vector Multiplication on Heterogeneous Processors
Sparse matrix-vector multiplication (SpMV) is a central building block for
scientific software and graph applications. Recently, heterogeneous processors
composed of different types of cores attracted much attention because of their
flexible core configuration and high energy efficiency. In this paper, we
propose a compressed sparse row (CSR) format based SpMV algorithm utilizing
both types of cores in a CPU-GPU heterogeneous processor. We first
speculatively execute segmented sum operations on the GPU part of a
heterogeneous processor and generate a possibly incorrect results. Then the CPU
part of the same chip is triggered to re-arrange the predicted partial sums for
a correct resulting vector. On three heterogeneous processors from Intel, AMD
and nVidia, using 20 sparse matrices as a benchmark suite, the experimental
results show that our method obtains significant performance improvement over
the best existing CSR-based SpMV algorithms. The source code of this work is
downloadable at https://github.com/bhSPARSE/Benchmark_SpMV_using_CSRComment: 22 pages, 8 figures, Published at Parallel Computing (PARCO
Matrix Factorization at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies
We explore the trade-offs of performing linear algebra using Apache Spark,
compared to traditional C and MPI implementations on HPC platforms. Spark is
designed for data analytics on cluster computing platforms with access to local
disks and is optimized for data-parallel tasks. We examine three widely-used
and important matrix factorizations: NMF (for physical plausability), PCA (for
its ubiquity) and CX (for data interpretability). We apply these methods to
TB-sized problems in particle physics, climate modeling and bioimaging. The
data matrices are tall-and-skinny which enable the algorithms to map
conveniently into Spark's data-parallel model. We perform scaling experiments
on up to 1600 Cray XC40 nodes, describe the sources of slowdowns, and provide
tuning guidance to obtain high performance
Gunrock: GPU Graph Analytics
For large-scale graph analytics on the GPU, the irregularity of data access
and control flow, and the complexity of programming GPUs, have presented two
significant challenges to developing a programmable high-performance graph
library. "Gunrock", our graph-processing system designed specifically for the
GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on
operations on a vertex or edge frontier. Gunrock achieves a balance between
performance and expressiveness by coupling high performance GPU computing
primitives and optimization strategies with a high-level programming model that
allows programmers to quickly develop new graph primitives with small code size
and minimal GPU programming knowledge. We characterize the performance of
various optimization strategies and evaluate Gunrock's overall performance on
different GPU architectures on a wide range of graph primitives that span from
traversal-based algorithms and ranking algorithms, to triangle counting and
bipartite-graph-based algorithms. The results show that on a single GPU,
Gunrock has on average at least an order of magnitude speedup over Boost and
PowerGraph, comparable performance to the fastest GPU hardwired primitives and
CPU shared-memory graph libraries such as Ligra and Galois, and better
performance than any other GPU high-level graph library.Comment: 52 pages, invited paper to ACM Transactions on Parallel Computing
(TOPC), an extended version of PPoPP'16 paper "Gunrock: A High-Performance
Graph Processing Library on the GPU
Iteration-fusing conjugate gradient
This paper presents the Iteration-Fusing Conjugate Gradient (IFCG) approach which is an evolution of the Conjugate Gradient method that consists in i) letting computations from different iterations to overlap between them and ii) splitting linear algebra kernels into subkernels to increase concurrency and relax data-dependencies. The paper presents two ways of applying the IFCG approach: The IFCG1 algorithm, which aims at hiding the cost of parallel reductions, and the IFCG2 algorithm, which aims at reducing idle time by starting computations as soon as possible. Both IFCG1 and IFCG2 algorithms are two complementary approaches aiming at increasing parallel performance. Extensive numerical experiments are conducted to compare the IFCG1 and IFCG2 numerical stability and performance against four state-of-the-art techniques. By considering a set of representative input matrices, the paper demonstrates that IFCG1 and IFCG2 provide parallel performance improvements up to 42.9% and 41.5% respectively and average improvements of 11.8% and 7.1% with respect to the best state-of-the-art techniques while keeping similar numerical stability properties. Also, this paper provides an evaluation of the IFCG algorithms' sensitivity to system noise and it demonstrates that they run 18.0% faster on average than the best state-of-the-art technique under realistic degrees of system noise.This work has been supported by the Spanish Government (Severo Ochoa grants SEV2015-0493), by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316) , by Generalitat de Catalunya
(contracts 2014-SGR-1051 and 2014-SGR-1272) and by the IBM/BSC Deep Learning Center Initiative.Peer ReviewedPostprint (author's final draft
- …