164 research outputs found
Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors
This paper presents a low-overhead optimizer for the ubiquitous sparse
matrix-vector multiplication (SpMV) kernel. Architectural diversity among
different processors together with structural diversity among different sparse
matrices lead to bottleneck diversity. This justifies an SpMV optimizer that is
both matrix- and architecture-adaptive through runtime specialization. To this
direction, we present an approach that first identifies the performance
bottlenecks of SpMV for a given sparse matrix on the target platform either
through profiling or by matrix property inspection, and then selects suitable
optimizations to tackle those bottlenecks. Our optimization pool is based on
the widely used Compressed Sparse Row (CSR) sparse matrix storage format and
has low preprocessing overheads, making our overall approach practical even in
cases where fast decision making and optimization setup is required. We
evaluate our optimizer on three x86-based computing platforms and demonstrate
that it is able to distinguish and appropriately optimize SpMV for the majority
of matrices in a representative test suite, leading to significant speedups
over the CSR and Inspector-Executor CSR SpMV kernels available in the latest
release of the Intel MKL library.Comment: 10 pages, 7 figures, ICPP 201
Portable performance on heterogeneous architectures
Trends in both consumer and high performance computing are bringing not only more cores, but also increased heterogeneity among the computational resources within a single machine. In many machines, one of the greatest computational resources is now their graphics coprocessors (GPUs), not just their primary CPUs. But GPU programming and memory models differ dramatically from conventional CPUs, and the relative performance characteristics of the different processors vary widely between machines. Different processors within a system often perform best with different algorithms and memory usage patterns, and achieving the best overall performance may require mapping portions of programs across all types of resources in the machine.
To address the problem of efficiently programming machines with increasingly heterogeneous computational resources, we propose a programming model in which the best mapping of programs to processors and memories is determined empirically. Programs define choices in how their individual algorithms may work, and the compiler generates further choices in how they can map to CPU and GPU processors and memory systems. These choices are given to an empirical autotuning framework that allows the space of possible implementations to be searched at installation time. The rich choice space allows the autotuner to construct poly-algorithms that combine many different algorithmic techniques, using both the CPU and the GPU, to obtain better performance than any one technique alone. Experimental results show that algorithmic changes, and the varied use of both CPUs and GPUs, are necessary to obtain up to a 16.5x speedup over using a single program configuration for all architectures.United States. Dept. of Energy (Award DE-SC0005288)United States. Defense Advanced Research Projects Agency (Award HR0011-10-9-0009)National Science Foundation (U.S.) (Award CCF-0632997
Doctor of Philosophy
dissertationMemory access irregularities are a major bottleneck for bandwidth limited problems on Graphics Processing Unit (GPU) architectures. GPU memory systems are designed to allow consecutive memory accesses to be coalesced into a single memory access. Noncontiguous accesses within a parallel group of threads working in lock step may cause serialized memory transfers. Irregular algorithms may have data-dependent control flow and memory access, which requires runtime information to be evaluated. Compile time methods for evaluating parallelism, such as static dependence graphs, are not capable of evaluating irregular algorithms. The goals of this dissertation are to study irregularities within the context of unstructured mesh and sparse matrix problems, analyze the impact of vectorization widths on irregularities, and present data-centric methods that improve control flow and memory access irregularity within those contexts. Reordering associative operations has often been exploited for performance gains in parallel algorithms. This dissertation presents a method for associative reordering of stencil computations over unstructured meshes that increases data reuse through caching. This novel parallelization scheme offers considerable speedups over standard methods. Vectorization widths can have significant impact on performance in vectorized computations. Although the hardware vector width is generally fixed, the logical vector width used within a computation can range from one up to the width of the computation. Significant performance differences can occur due to thread scheduling and resource limitations. This dissertation analyzes the impact of vectorization widths on dense numerical computations such as 3D dG postprocessing. It is difficult to efficiently perform dynamic updates on traditional sparse matrix formats. Explicitly controlling memory segmentation allows for in-place dynamic updates in sparse matrices. Dynamically updating the matrix without rebuilding or sorting greatly improves processing time and overall throughput. This dissertation presents a new sparse matrix format, dynamic compressed sparse row (DCSR), which allows for dynamic streaming updates to a sparse matrix. A new method for parallel sparse matrix-matrix multiplication (SpMM) that uses dynamic updates is also presented
BaCO: A Fast and Portable Bayesian Compiler Optimization Framework
We introduce the Bayesian Compiler Optimization framework (BaCO), a general
purpose autotuner for modern compilers targeting CPUs, GPUs, and FPGAs. BaCO
provides the flexibility needed to handle the requirements of modern autotuning
tasks. Particularly, it deals with permutation, ordered, and continuous
parameter types along with both known and unknown parameter constraints. To
reason about these parameter types and efficiently deliver high-quality code,
BaCO uses Bayesian optimiza tion algorithms specialized towards the autotuning
domain. We demonstrate BaCO's effectiveness on three modern compiler systems:
TACO, RISE & ELEVATE, and HPVM2FPGA for CPUs, GPUs, and FPGAs respectively. For
these domains, BaCO outperforms current state-of-the-art autotuners by
delivering on average 1.36x-1.56x faster code with a tiny search budget, and
BaCO is able to reach expert-level performance 2.9x-3.9x faster
A Framework for Automated Generation of Specialized Function Variants
Efficient large-scale scientific computing requires efficient code, yet optimizing code to render it efficient simultaneously renders the code less readable, less maintainable, less portable, and requires detailed knowledge of low-level computer architecture, which the developers of scientific applications may lack. The necessary knowledge is subject to change over time as new architectures, such as GPGPU architectures like CUDA, which require very different optimizations than CPU-targeted code, become more prominent. The development of scientific cloud computing means that developers may not even know what machine their code will be running on when they are developing it.
This work takes steps towards automating the generation of code variants which are automatically optimized for both execution environment and input dataset. We demonstrate that augmenting an autotuning framework with a performance database which captures metadata about environment and input and performing decision tree learning over that data can help more fully automate the process of enhancing software performance
An Efficient Fill Estimation Algorithm for Sparse Matrices and Tensors in Blocked Formats
Tensors, linear-algebraic extensions of matrices in arbitrary dimensions, have numerous applications in computer science and computational science. Many tensors are sparse, containing more than 90% zero entries. Efficient algorithms can leverage sparsity to do less work, but the irregular locations of the nonzero entries pose challenges to performance engineers. Many tensor operations such as tensor-vector multiplications can be sped up substantially by breaking the tensor into equally sized blocks (only storing blocks which contain nonzeros) and performing operations in each block using carefully tuned code. However, selecting the best block size is computationally challenging. Previously, Vuduc et al. defined the fill of a sparse tensor to be the number of stored entries in the blocked format divided by the number of nonzero entries, and showed that the fill can be used as an effective heuristic to choose a good block size. However, they gave no accuracy bounds for their method for estimating the fill, and it is vulnerable to adversarial examples. In this paper, we present a sampling-based method for finding a (1 + epsilon)-approximation to the fill of an order N tensor for all block sizes less than B, with probability at least 1 - delta, using O(B^(2N) log(B^N / delta) / epsilon^2) samples for each block size. We introduce an efficient routine to sample for all B^N block sizes at once in O(N B^N) time. We extend our concentration bounds to a more efficient bound based on sampling without replacement, using the recent Hoeffding-Serfling inequality. We then implement our algorithm and compare our scheme to that of Vuduc, as implemented in the Optimized Sparse Kernel Interface (OSKI) library. We find that our algorithm provides faster estimates of the fill at all accuracy levels, providing evidence that this is both a theoretical and practical improvement. Our code is available under the BSD 3-clause license at https://github.com/peterahrens/FillEstimation
Speculative Segmented Sum for Sparse Matrix-Vector Multiplication on Heterogeneous Processors
Sparse matrix-vector multiplication (SpMV) is a central building block for
scientific software and graph applications. Recently, heterogeneous processors
composed of different types of cores attracted much attention because of their
flexible core configuration and high energy efficiency. In this paper, we
propose a compressed sparse row (CSR) format based SpMV algorithm utilizing
both types of cores in a CPU-GPU heterogeneous processor. We first
speculatively execute segmented sum operations on the GPU part of a
heterogeneous processor and generate a possibly incorrect results. Then the CPU
part of the same chip is triggered to re-arrange the predicted partial sums for
a correct resulting vector. On three heterogeneous processors from Intel, AMD
and nVidia, using 20 sparse matrices as a benchmark suite, the experimental
results show that our method obtains significant performance improvement over
the best existing CSR-based SpMV algorithms. The source code of this work is
downloadable at https://github.com/bhSPARSE/Benchmark_SpMV_using_CSRComment: 22 pages, 8 figures, Published at Parallel Computing (PARCO
Sparse matrix-vector multiplication on GPGPU clusters: A new storage format and a scalable implementation
Sparse matrix-vector multiplication (spMVM) is the dominant operation in many
sparse solvers. We investigate performance properties of spMVM with matrices of
various sparsity patterns on the nVidia "Fermi" class of GPGPUs. A new "padded
jagged diagonals storage" (pJDS) format is proposed which may substantially
reduce the memory overhead intrinsic to the widespread ELLPACK-R scheme. In our
test scenarios the pJDS format cuts the overall spMVM memory footprint on the
GPGPU by up to 70%, and achieves 95% to 130% of the ELLPACK-R performance.
Using a suitable performance model we identify performance bottlenecks on the
node level that invalidate some types of matrix structures for efficient
multi-GPGPU parallelization. For appropriate sparsity patterns we extend
previous work on distributed-memory parallel spMVM to demonstrate a scalable
hybrid MPI-GPGPU code, achieving efficient overlap of communication and
computation.Comment: 10 pages, 5 figures. Added reference to other recent sparse matrix
format
- …