Search CORE

123 research outputs found

CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication

Author: Liu Weifeng
Vinter Brian
Publication venue
Publication date: 09/04/2015
Field of study

Sparse matrix-vector multiplication (SpMV) is a fundamental building block for numerous applications. In this paper, we propose CSR5 (Compressed Sparse Row 5), a new storage format, which offers high-throughput SpMV on various platforms including CPUs, GPUs and Xeon Phi. First, the CSR5 format is insensitive to the sparsity structure of the input matrix. Thus the single format can support an SpMV algorithm that is efficient both for regular matrices and for irregular matrices. Furthermore, we show that the overhead of the format conversion from the CSR to the CSR5 can be as low as the cost of a few SpMV operations. We compare the CSR5-based SpMV algorithm with 11 state-of-the-art formats and algorithms on four mainstream processors using 14 regular and 10 irregular matrices as a benchmark suite. For the 14 regular matrices in the suite, we achieve comparable or better performance over the previous work. For the 10 irregular matrices, the CSR5 obtains average performance improvement of 17.6\%, 28.5\%, 173.0\% and 293.3\% (up to 213.3\%, 153.6\%, 405.1\% and 943.3\%) over the best existing work on dual-socket Intel CPUs, an nVidia GPU, an AMD GPU and an Intel Xeon Phi, respectively. For real-world applications such as a solver with only tens of iterations, the CSR5 format can be more practical because of its low-overhead for format conversion. The source code of this work is downloadable at https://github.com/bhSPARSE/Benchmark_SpMV_using_CSR5Comment: 12 pages, 10 figures, In Proceedings of the 29th ACM International Conference on Supercomputing (ICS '15

arXiv.org e-Print Archive

Copenhagen University Research Information System

Speculative Segmented Sum for Sparse Matrix-Vector Multiplication on Heterogeneous Processors

Author: Liu Weifeng
Vinter Brian
Publication venue: 'Elsevier BV'
Publication date: 14/09/2015
Field of study

Sparse matrix-vector multiplication (SpMV) is a central building block for scientific software and graph applications. Recently, heterogeneous processors composed of different types of cores attracted much attention because of their flexible core configuration and high energy efficiency. In this paper, we propose a compressed sparse row (CSR) format based SpMV algorithm utilizing both types of cores in a CPU-GPU heterogeneous processor. We first speculatively execute segmented sum operations on the GPU part of a heterogeneous processor and generate a possibly incorrect results. Then the CPU part of the same chip is triggered to re-arrange the predicted partial sums for a correct resulting vector. On three heterogeneous processors from Intel, AMD and nVidia, using 20 sparse matrices as a benchmark suite, the experimental results show that our method obtains significant performance improvement over the best existing CSR-based SpMV algorithms. The source code of this work is downloadable at https://github.com/bhSPARSE/Benchmark_SpMV_using_CSRComment: 22 pages, 8 figures, Published at Parallel Computing (PARCO

arXiv.org e-Print Archive

Copenhagen University Research Information System

Managing Communication Latency-Hiding at Runtime for Parallel Programming Languages and Libraries

Author: Kristensen Mads Ruben Burgdorff
Vinter Brian
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2012
Field of study

This work introduces a runtime model for managing communication with support for latency-hiding. The model enables non-computer science researchers to exploit communication latency-hiding techniques seamlessly. For compiled languages, it is often possible to create efficient schedules for communication, but this is not the case for interpreted languages. By maintaining data dependencies between scheduled operations, it is possible to aggressively initiate communication and lazily evaluate tasks to allow maximal time for the communication to finish before entering a wait state. We implement a heuristic of this model in DistNumPy, an auto-parallelizing version of numerical Python that allows sequential NumPy programs to run on distributed memory architectures. Furthermore, we present performance comparisons for eight benchmarks with and without automatic latency-hiding. The results shows that our model reduces the time spent on waiting for communication as much as 27 times, from a maximum of 54% to only 2% of the total execution time, in a stencil application.Comment: PREPRIN

arXiv.org e-Print Archive

Crossref

Copenhagen University Research Information System

A Distributed Virtual Machine for Microsoft. NET

Author: Larsen Morten N
Vinter Brian
Publication venue: 'Scientific Research Publishing, Inc.'
Publication date: 01/01/2012
Field of study

Copenhagen University Research Information System

cphVB: A System for Automated Runtime Optimization and Parallelization of Vectorized Applications

Author: Blum Troels
Kristensen Mads Ruben Burgdorff
Lund Simon Andreas Frimann
Vinter Brian
Publication venue
Publication date: 01/01/2012
Field of study

Modern processor architectures, in addition to having still more cores, also require still more consideration to memory-layout in order to run at full capacity. The usefulness of most languages is deprecating as their abstractions, structures or objects are hard to map onto modern processor architectures efficiently. The work in this paper introduces a new abstract machine framework, cphVB, that enables vector oriented high-level programming languages to map onto a broad range of architectures efficiently. The idea is to close the gap between high-level languages and hardware optimized low-level implementations. By translating high-level vector operations into an intermediate vector bytecode, cphVB enables specialized vector engines to efficiently execute the vector operations. The primary success parameters are to maintain a complete abstraction from low-level details and to provide efficient code execution across different, modern, processors. We evaluate the presented design through a setup that targets multi-core CPU architectures. We evaluate the performance of the implementation using Python implementations of well-known algorithms: a jacobi solver, a kNN search, a shallow water simulation and a synthetic stencil simulation. All demonstrate good performance

arXiv.org e-Print Archive

Crossref

Copenhagen University Research Information System

Managing Overlapping Data Structures for Data-Parallel Applications on Distributed Memory Architectures

Author: . Brian Vinter
. Mads Ruben Burgdorff Kristensen
Publication venue: GSTF Journal on Computing (JoC)
Publication date: 28/08/2014
Field of study

In this paper, we introduce a model for managing abstract data structures that map to arbitrary distributed memory architectures. It is difficult to achieve scalable performance in data-parallel applications where the programmer manipulates abstract data structures rather than directly manipulating memory. On distributed memory architectures such abstract data-parallel operations may require communication between nodes. Therefore, the underlying system has to handle communication efficiently without any help from the user. Our data model splits data blocks into two sets -- local data and remote data -- and schedules the sub-block by availability at runtime.We implement the described model in DistNumPy -- a high-productivity programming library for Python. We go on to evaluate the implementation using a representative distributed memory system -- a Cray XE-6 Supercomputer -- up to 2048 cores. The benchmarking results demonstrate scalable good performance

GSTF Digital Library (GSTF-DL): Open Journal Systems (Global Science and Technology Forum)

PyCSP - Communicating Sequential Processes for Python

Author: Anshus Otto Johan
Bjørndalen John Markus
Vinter Brian
Publication venue
Publication date: 01/01/2008
Field of study

Copenhagen University Research Information System

PyCSP - controlled concurrency

Author: Bjørndalen John Markus
Friborg Rune Møllegaard
Vinter Brian
Publication venue
Publication date: 01/01/2010
Field of study

Copenhagen University Research Information System

GPAW optimized for Blue Gene/P using hybrid programming

Author: Happe Hans Henrik
Kristensen Mads Ruben Burgdorff
Vinter Brian
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2009
Field of study

Copenhagen University Research Information System

Doubling the Performance of Python/NumPy with less than 100 SLOC

Author: Kristensen Mads Ruben Burgdorff
Lund Simon AF
Skovhede Kenneth
Vinter Brian
Publication venue
Publication date: 26/11/2013
Field of study

Copenhagen University Research Information System