Search CORE

2,700 research outputs found

Importance of Explicit Vectorization for CPU and GPU Software Performance

Author: Allen
Anderson
Berg
Eichenberger
Firas Hamze
Hamze
Kamran Karimi
Karimi
Karimi
Kirk
Knuth
Marsaglia
Matsumoto
Metropolis
Neil G. Dickson
Owens
Preis
Samant
Scott
Suzuki
Tomov
Publication venue: 'Elsevier BV'
Publication date: 31/03/2010
Field of study

Much of the current focus in high-performance computing is on multi-threading, multi-computing, and graphics processing unit (GPU) computing. However, vectorization and non-parallel optimization techniques, which can often be employed additionally, are less frequently discussed. In this paper, we present an analysis of several optimizations done on both central processing unit (CPU) and GPU implementations of a particular computationally intensive Metropolis Monte Carlo algorithm. Explicit vectorization on the CPU and the equivalent, explicit memory coalescing, on the GPU are found to be critical to achieving good performance of this algorithm in both environments. The fully-optimized CPU version achieves a 9x to 12x speedup over the original CPU version, in addition to speedup from multi-threading. This is 2x faster than the fully-optimized GPU version.Comment: 17 pages, 17 figure

arXiv.org e-Print Archive

Crossref

A Library for Pattern-based Sparse Matrix Vector Multiply

Author: Back Godmar
Belgin Mehmet
Ribbens Calvin
Publication venue
Publication date: 01/01/2009
Field of study

Pattern-based Representation (PBR) is a novel approach to improving the performance of Sparse Matrix-Vector Multiply (SMVM) numerical kernels. Motivated by our observation that many matrices can be divided into blocks that share a small number of distinct patterns, we generate custom multiplication kernels for frequently recurring block patterns. The resulting reduction in index overhead significantly reduces memory bandwidth requirements and improves performance. Unlike existing methods, PBR requires neither detection of dense blocks nor zero filling, making it particularly advantageous for matrices that lack dense nonzero concentrations. SMVM kernels for PBR can benefit from explicit prefetching and vectorization, and are amenable to parallelization. The analysis and format conversion to PBR is implemented as a library, making it suitable for applications that generate matrices dynamically at runtime. We present sequential and parallel performance results for PBR on two current multicore architectures, which show that PBR outperforms available alternatives for the matrices to which it is applicable, and that the analysis and conversion overhead is amortized in realistic application scenarios

Computer Science Technical Reports @Virginia Tech