Search CORE

8,115 research outputs found

Speculative Segmented Sum for Sparse Matrix-Vector Multiplication on Heterogeneous Processors

Author: Liu Weifeng
Vinter Brian
Publication venue: 'Elsevier BV'
Publication date: 14/09/2015
Field of study

Sparse matrix-vector multiplication (SpMV) is a central building block for scientific software and graph applications. Recently, heterogeneous processors composed of different types of cores attracted much attention because of their flexible core configuration and high energy efficiency. In this paper, we propose a compressed sparse row (CSR) format based SpMV algorithm utilizing both types of cores in a CPU-GPU heterogeneous processor. We first speculatively execute segmented sum operations on the GPU part of a heterogeneous processor and generate a possibly incorrect results. Then the CPU part of the same chip is triggered to re-arrange the predicted partial sums for a correct resulting vector. On three heterogeneous processors from Intel, AMD and nVidia, using 20 sparse matrices as a benchmark suite, the experimental results show that our method obtains significant performance improvement over the best existing CSR-based SpMV algorithms. The source code of this work is downloadable at https://github.com/bhSPARSE/Benchmark_SpMV_using_CSRComment: 22 pages, 8 figures, Published at Parallel Computing (PARCO

arXiv.org e-Print Archive

Copenhagen University Research Information System

A NoC-based hybrid message-passing/shared-memory approach to CMP design

Author: Agarwal
Daemen
Forsell
Grecu
Karniadakis
Lorensen
Mario R. Casu
Massimo Ruo Roch
Maurizio Zamboni
Owens
Paulin
Radulescu
Sergio V. Tota
Snir
Tota
Publication venue: Elsevier
Publication date: 01/01/2011
Field of study

Crossref

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

An OpenSHMEM Implementation for the Adapteva Epiphany Coprocessor

Author: D Richie
D Richie
J Ross
JA Ross
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 11/08/2016
Field of study

This paper reports the implementation and performance evaluation of the OpenSHMEM 1.3 specification for the Adapteva Epiphany architecture within the Parallella single-board computer. The Epiphany architecture exhibits massive many-core scalability with a physically compact 2D array of RISC CPU cores and a fast network-on-chip (NoC). While fully capable of MPMD execution, the physical topology and memory-mapped capabilities of the core and network translate well to Partitioned Global Address Space (PGAS) programming models and SPMD execution with SHMEM.Comment: 14 pages, 9 figures, OpenSHMEM 2016: Third workshop on OpenSHMEM and Related Technologie

arXiv.org e-Print Archive

Crossref

Macroservers: An Execution Model for DRAM Processor-In-Memory Arrays

Author: Sterling Thomas L.
Zima Hans P.
Publication venue: 'California Institute of Technology Library'
Publication date: 01/01/2000
Field of study

The emergence of semiconductor fabrication technology allowing a tight coupling between high-density DRAM and CMOS logic on the same chip has led to the important new class of Processor-In-Memory (PIM) architectures. Newer developments provide powerful parallel processing capabilities on the chip, exploiting the facility to load wide words in single memory accesses and supporting complex address manipulations in the memory. Furthermore, large arrays of PIMs can be arranged into a massively parallel architecture. In this report, we describe an object-based programming model based on the notion of a macroserver. Macroservers encapsulate a set of variables and methods; threads, spawned by the activation of methods, operate asynchronously on the variables' state space. Data distributions provide a mechanism for mapping large data structures across the memory region of a macroserver, while work distributions allow explicit control of bindings between threads and data. Both data and work distributuions are first-class objects of the model, supporting the dynamic management of data and threads in memory. This offers the flexibility required for fully exploiting the processing power and memory bandwidth of a PIM array, in particular for irregular and adaptive applications. Thread synchronization is based on atomic methods, condition variables, and futures. A special type of lightweight macroserver allows the formulation of flexible scheduling strategies for the access to resources, using a monitor-like mechanism

CiteSeerX

Caltech Authors